Topics, Events, Stories in Social Media

Ting Hua

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science and Applications

Chang-Tien Lu, Chair Naren Ramakrishnan Ing-Ray Chen Chandan K. Reddy Zhenhui Jessie Li

Dec 15, 2017 Falls Church, Virginia

Keywords: Social media, Topic modeling, Event Detection Copyright 2017, Ting Hua Topics, Events, and Stories in Social Media Ting Hua (ABSTRACT)

This thesis focuses on developing methods for social media analysis. Specifically, five directions are proposed here: 1) semi-supervised detection for targeted-domain events, 2) topical interaction study among multiple datasets, 3) discriminative learning about the identifications for common and distinctive topics, 4) epidemics modeling for flu forecasting with simulation via signals from social media data, 5) storyline generation for massive unorganized documents. For the first method, existing solutions in spatiotemporal event detection are mostly supervised ap- proaches that require expensive human efforts in labeling work. The contributions of our proposed work include: (1) Developed a semi-supervised framework, (2) Designed a novel label genera- tion method, and (3) Proposed an innovative multinomial spatial-scan algorithm. For the second method, most traditional solutions in topic modeling are designed to analyze formal documents such as news reports, but can not handle the noisy social media data efficiently and effectively. The contributions of the proposed work for the second task include: (1) Proposed a novel gener- ative model jointly considering Twitter and news data in one unified framework, (2) Designed an effective algorithm for model parameter inference, and (3) Explored the real world applications by utilizing outputs of the proposed model. Discriminative learning is the basis for comparative thinking, however, most related previous studies only work in the scenario that involving two- dataset. The third proposed work contributes in following aspects: (1) Proposed a Bayesian model to identify common and distinct topics for multiple datasets, (2) Developed efficient parameter inference algorithms based on Gibbs Sampling, and (3) Evaluated the proposed model on various datasets with comparison to important baselines. Existing work on epidemics modeling either can not guarantee the timeliness of disease surveillance, or can not accurately characterize the under- lying mechanism of flu spreading. The contributions of the fourth task include: (1) Proposed a novel integrated framework combining computational epidemiology and social media mining, (2) Designed an effective algorithm for model parameter inference, and (3) Compared the proposed method with important baselines on various datasets. In the filed of storyline generation, traditional solutions can not clearly represent the underlying structure of related events. And at the same time, most of them require human recognized labels as inputs. The contributions for this work include: (1) Proposed a generative framework for storyline detection, and (2) Developed efficient parameter inference algorithms, and (3) Utilized the proposed model to analyze the real world cases. This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, IARPA, DoI/NBC, or the US Government. Topics, Events, and Stories in Social Media Ting Hua (GENERAL AUDIENCE ABSTRACT)

The rise of “big data”, especially social media data (e.g., Twitter, Facebook, Youtube), gives new opportunities to the understanding of human behavior. Consequently, novel computing methods for mining patterns in social media data are therefore desired. Through applying these approaches, it has become possible to aggregate public available data to capture triggers underlying events, detect on-going trends, and forecast future happenings. This dissertation provides comprehensive studies for social media data analysis. The goals of the dissertation include: event early detection, future event prediction, and event chain organization. Specifically, these goals are achieved through efforts in the following aspects: (1) semi-supervised and unsupervised methods are developed to collect early signals from social media data and de- tect on-going events; (2) graphical models are proposed to model the interaction and comparison among multiple datasets; (3) traditional computational methods are combined with new emerge social media data analysis for the purpose of fast epidemic prediction; (4) events in different time stamps are organized into event chains via novel probabilistic models. The effectiveness of our approaches is evaluated using various datasets, such as Twitter posts and news articles. Also, interesting case studies are provided to show models’ abilities in the real world exploration. To my mother, for her love and support and spirit. Acknowledgments

I feel appreciated to many friends and colleagues for their great supports of my Ph.D study. First and foremost, I express my deepest gratitude to my advisor and mentor, Dr. Chang-Tien Lu, to thank his advice and support. Dr. Lu is the best advisor a PhD student can hope. His great skills in advise are combination of intelligence, patience, and support. His guidance and wisdom made my Ph.D both interesting and productive. Dr. Lu can understand the details of my work quickly, and capture the key values. Dr. Lu also helped me a lot to make my presentation clear, simple, and easy to follow. Dr. Lu is a hard working man, who spent most of his time in lab. I will leave VT with his valuable advice and respectable quality, which will continue to benefit me both in life and research. Also, I feel rather thankful to Dr. Naren Ramakrishnan. He continuously gave me helpful suggestions in the research directions all these years. He broadened my research views and I always felt inspired during our conversations. I want to thank Dr. Ing-Ray Chen for his great advice and keeps me to a high standard in the presentation during our collaboration. His efforts made initial proposals in class into great research publications. I want to thank Dr. Reddy for his help, advice, and guidance for multiple work. He is a knowledgeable professor who can always give me new information about the most update-to-date research. Thank you to Dr. Zhenhui Li. Your insightful feedback and comments always gave me new views to rethink about my work. Also, I need to thank Johnny Cash, Bob Dylan, and Nirvana; Nietzsche and Camus; Marguerite Duras, Gabriel Garcia Marquez, and Jorge Luis Borges. With them, this dissertation is delayed by at least 1 year. However, without them, it can never be started and finished. Contents

1 Introduction 1 1.1 Research Issues ...... 3 1.1.1 Twitter Event Detection ...... 3 1.1.2 Underlying Factors behind Social Media and News ...... 4 1.1.3 Learning Common and Distinctive Topics from Multiple Datasets . . . . .4 1.1.4 Seeding Simulation with Updates from Social Media Data ...... 4 1.1.5 Storyline Generation using Social Media ...... 5 1.2 Goals and Contributions ...... 5 1.3 Organization ...... 8

2 Twitter Event Detection 9 2.1 STED: Semi-Supervised Targeted Event Detection ...... 9 2.1.1 Introduction ...... 9 2.1.2 Framework and Methods ...... 10 Automatical Label Creation and Expansion ...... 11 Twitter Text Classification ...... 11 Location Estimation ...... 13 2.1.3 Demonstration ...... 14 2.1.4 Conclusion ...... 16 2.2 Automatic Targeted-Domain Spatiotemporal Event Detection in Twitter ...... 16 2.2.1 Introduction ...... 16

1 2.2.2 Related Work ...... 19 Event detection in newswire documents ...... 19 General-domain event detection in Twitter ...... 20 Targeted-domain event detection in Twitter ...... 20 Distant supervision and transfer learning ...... 21 2.2.3 Framework and Problem Formulation ...... 22 Framework ...... 22 Problem Formulation ...... 23 2.2.4 Automatic Label Generation ...... 25 Feature Extraction ...... 25 Relevancy Ranking ...... 26 Textual Similarity ...... 26 Spatial Similarity ...... 27 Temporal Similarity ...... 27 Label Refinement ...... 28 2.2.5 Spatiotemporal Event Detection ...... 29 Tweet Classifier ...... 29 Event Location Estimation ...... 31 2.2.6 Results ...... 33 Datasets and evaluation metrics ...... 33 Methods for Comparison ...... 35 Parameter settings ...... 36 Performance Analysis ...... 37 Overall Relevance Evaluation ...... 37 Evaluation of the Tweet Classifier ...... 42 Case Study ...... 46 2.2.7 Conclusion ...... 46

3 Underlying Factors behind Social Media and News 47

2 3.1 Analyzing Civil Unrest through Social Media ...... 47 3.1.1 Introduction ...... 47 3.1.2 Event-related Tweet Extraction ...... 49 3.1.3 Identifying Contributing Factors ...... 51 3.1.4 Event Evolution Analysis ...... 53 3.1.5 Conclusion ...... 55 3.2 Topical Analysis of Interactions Between News and Social Media ...... 55 3.2.1 Introduction ...... 55 3.2.2 Related Work ...... 58 Topic Modeling on Short Texts ...... 58 Transfer Knowledge in Multiple Datasets ...... 59 Mining Time Series and Topic Evolution ...... 59 3.2.3 Problem statement and Model ...... 59 3.2.4 Problem Statement ...... 59 Model ...... 61 3.2.5 Inference via Gibbs Sampling ...... 63 3.2.6 Discovery for topic lags and influence ...... 67 Topic distribution differences ...... 67 Topic temporal patterns ...... 68 Topic influence ...... 68 Key news reports and tweets ...... 68 3.2.7 Experiment ...... 69 Dataset ...... 69 Results of modeling performance ...... 70 Results of topic evolution discovery ...... 72 3.3 Conclusion ...... 77

4 A Probabilistic Model for Discovering Common and Distinctive Topics from Multiple Datasets 78

3 4.1 Introduction ...... 78 4.2 Related Work ...... 81 4.2.1 Traditional Topic Models ...... 81 4.2.2 Discriminative Topic Modeling ...... 81 4.2.3 Global and Local Aspects Mining ...... 82 4.3 Proposed Method ...... 82 4.3.1 Problem Statement ...... 82 4.3.2 Model Definition ...... 82 4.4 Inference ...... 86 4.4.1 Joint distribution ...... 86 4.4.2 Hidden Variables ...... 86 4.4.3 Multinomial Parameters ...... 87 4.4.4 Gibbs sampling algorithm ...... 88 4.5 Experiments ...... 88 4.5.1 Datasets and Experiment Settings ...... 90 4.5.2 Comparison methods and validation metrics ...... 91 4.5.3 Quantitative Performance ...... 92 Parameter Sensitivity Analysis ...... 92 Clustering Performance ...... 93 4.5.4 Topic Distributions ...... 94 4.5.5 Topic Discovery on Multiple Collections ...... 97 4.6 Conclusion ...... 98

5 Social Media based Simulation Models for Understanding Disease Dynamics 101 5.1 Introduction ...... 101 5.2 Related Work ...... 104 5.3 The Proposed SMS Model ...... 105 5.3.1 Learning in Social Media Space ...... 105 5.3.2 Learning in Simulation Space ...... 109

4 5.3.3 Interaction between two spaces ...... 109 5.4 Model Inference ...... 110 5.5 Experimental Results ...... 112 5.5.1 Datasets ...... 112 5.5.2 Labels and Evaluation Metrics ...... 114 5.5.3 Comparison Methods ...... 115 5.5.4 Results ...... 115 Performance on Pearson correlation...... 115 Performance on mean squared error and peak-time error...... 117 5.6 Conclusion ...... 117

6 Automatical Storyline Generation with Help from Twitter 119 6.1 Introduction ...... 119 6.2 Related Work ...... 121 6.3 Model ...... 122 6.4 Model Inference and Learning ...... 125 6.4.1 Model Inference ...... 125 6.4.2 Learning Operations ...... 126 6.5 Experiment ...... 127 6.5.1 Datasets and Experiment Settings ...... 127 6.5.2 Experiment Results ...... 128 6.6 Conclusion ...... 130

7 Completed Work and Future Work 132 7.1 Research Tasks ...... 133 7.1.1 Targeted-domain Twitter Event Detection ...... 133 7.1.2 News & Social Media Influence and Interaction ...... 133 7.1.3 Storyline Generation via help from Social Media ...... 134 7.1.4 Epidemic Simulation with updates from Social Media ...... 134

5 7.1.5 Learning Common and Distinctive Topics from Multiple Datasets . . . . . 134 7.2 Schedule ...... 135 7.3 Publications and submissions ...... 136 7.3.1 Current Publications ...... 136 7.3.2 Submitted and In-preparation papers ...... 137

8 Reference 138

6 List of Figures

2.1 System Framework of STED ...... 10 2.2 Tweets’ Social Ties Networks. Big nodes represent terms: Red nodes are hashtags, blue nodes are mentions, and yellow nodes are Retweets. Small nodes denote tweets: blue ones are labeled tweets, orange nodes are newly found tweets from raw data. Edge (i,t) means tweet t contains term i...... 12 2.3 Example of Tweet Location Clusters. Red nodes denote the highest density of tweets of locations...... 14 2.4 Interface of STED system ...... 15 2.5 Historical Analysis Screenshot ...... 15 2.8 ATSED system architecture...... 22 2.9 Example of tweet-tie heterogeneous graph. Big nodes represent social-ties: red nodes are hashtags, blue nodes are mentions, and yellow nodes are retweets. Small nodes denote tweets...... 30 2.12 Temporal performance comparison of ATSED, Earthquake, and TEDAS...... 38

3.1 News article from Milenio, a major Mexican newspaper, about a protest in Mexico City calling for the release of captured wild dogs alleged to have attacked and killed citizens. Like most such articles, it includes basic facts such as the date of the protest (12 January 2013), its location (El Zocalo, the city’s main public square), and the number of participants (150) but provides little insight into the incident’s underlying causes. The article, originally in Spanish, has been translated into English using Google Translate...... 48 3.2 Event-related tweet extraction pipeline. Red boxes indicate event words for the street dog liberation protest in Mexico City and yellow boxes denote topic key- words. The word cloud in the top left shows top-ranked topic keywords for Mex- ico, generated from a database of 2,141 protest events in that country from January 2011 to September 2013. The tweets, originally in Spanish, have been translated into English using Google Translate...... 50

7 3.3 Distribution of tweets related to the street dog liberation protest in Mexico City. Tweets spiked several days before the rally, which is typical of incidents of civil unrest. In contrast, tweets related to breaking news stories and major events such as natural disasters usually spike during the day of the event. The tweets, originally in Spanish, have been translated into English using Google Translate...... 52 3.4 Events leading up to the street dog liberation protest. The blue timeline indicates news reports, while the orange timeline denotes event-related tweets. On the blue line, triangles represent dates with emerging news, and squares are regular dates without emerging news. On the orange line, the size of the circle indicates the relative number of related tweets on the corresponding date. The original tweets, in Spanish, have been translated into English using Google Translate...... 54 3.5 An example of daily volume and topics on a particular theme in News data (top) vs Tweets data (bottom). Along the timeline (x-axis), the shaded areas represent the numeric values of raw document volume for news articles and tweets; the red and blue curves are hidden topics discovered by our NTIT model...... 57 3.6 NTIT graphical model ...... 62 3.7 Perplexity Comparison for News and Tweets Datasets ...... 71

4.1 Topic summaries for news articles published in October 2016 related to the US presidential election...... 79 4.2 Topic summaries for NIPS papers from 1987 to 2013...... 79 4.3 Framework of CDTM model...... 84 4.4 Performance comparison in terms of perplexity...... 99 4.5 Case study of gun shooting in United States...... 100

6.1 An example of the storyline-event-topic hierarchical structure of ASG...... 120 6.3 Relations among storyline, event type, and topics. The triangles are symbols for storylines, the circles denote event type, and the squares indicate topics...... 131

8 List of Tables

2.1 Distribution of events in 10 Latin countries. "News source" shows the news agen- cies utilized as sources for the GSR dataset...... 34 2.2 Spatial performance comparison among Twitter event detection methods (Preci- sion, Recall, F-score). Numbers in bold show the best F-score values in corre- sponding countries...... 34 2.3 Sample tweets for the baseline method and ATSED. Domain words are denoted by bold style and event words are marked with underlining. The tweets, originally in Spanish, have been translated into English using Google Translate...... 40 2.4 Labels quality evaluation through “Precision@K” ...... 42 2.5 Performance comparison for Twitter text classifiers (Precision, Recall, F-score). Upward arrows denote performance improvements over the original results shown in Table 2.2. Numbers in bold show the best F-score values for each country. . . . . 43

3.1 Mathematical Notation ...... 60 3.2 Distribution of events and tweets across 5 Latin countries. “News source” indicates the news agencies utilized as sources for News dataset...... 69 3.3 Top words of top topics of NTIT and LDA ...... 72 3.4 Topic Influence. “Twitter %” is the ratio of topic in Twitter data, while “News%” is the ratio of topic in news data. “Degree” denotes the node degree for each topic,“In%” is the ratio of in-coming edges, and “Out%” is proportion of out-going edges...... 74 3.5 Comparison of topic temporal patterns. “Pos%” denotes the ratio of peaks occur- ring earlier in Twitter than in news, “Neg%” implies that peaks appeared earlier in the news, and “Sim%” indicates the ratio of peaks that burst simultaneously in the two datasets. “Avg.Lag” indicates the average time lags between news and Twitter peaks, where positive values imply Twitter data come first while negative numbers denote the leading time of news data...... 75

9 3.6 Top 5 key news documents in “teacher protests” theme. Texts are translated from Spanish to English by Google translator...... 76 3.7 Top 5 key tweets in “teacher protests” theme. Texts are translated from Spanish to English by Google translator...... 76

4.1 Variable Notations ...... 83 4.2 Datasets ...... 90 4.3 The clustering performances achieved by NMF, LDA, discNMF, discLDA, and our proposed CDTM measured in terms of accuracy and NMI. Higher values indicate better performance...... 93 4.4 Word distributions for topics (10 most likely words) learned by the discNMF model and proposed CDTM model from the 4 area dataset...... 95

5.1 Mathematical Notation ...... 107

6.1 Detailed information of datasets ...... 128 6.2 Performance comparison among storyline detection methods (ACC, NMI) . . . . . 129 6.3 Example of background/storyline/topic words learned by ASG model...... 130

7.1 Research tasks and status ...... 135

10 Chapter 1

Introduction

In the last decade, dramatic changes have taken place in the world of Internet. One of these tremen- dous changes is the emerge of social media platform, such as Twitter and Facebook. Millions of active online users may produce billions of social media posts daily [53]. Such rich real time data bring opportunities to enormous applications in the filed of data mining, as well as producing new challenges to traditional machine learning technologies. Taking Twitter for example, this platform allows its users to create personal microblogs with a limitation of 140 characters. On one hand, this scheme encourages users to update their posts frequently and therefore results in Twitter’s huge data volume. On the other hand, these short posts produced by users are much “noisy” than traditional formal documents (e.g., news reports) and thus “make troubles” for traditional machine learning technologies to be applied. Specifically, social media data have following beneficial characteristics. 1) Up-to-date informa- tion. Social media such as Twitter is known to be much more prompt than the traditional news media. For example, the breaking news of “Michael Jackson’s death” started to spread in Twitter only minutes after the event happened [45]. 2) Extensive topics. Almost any real world events can be found in social media data, from daily life of ordinary people, news of recent sport games, stars’ affairs, to “serious matters” such as politics. 3) Diverse data types. Besides plain texts, social media softwares usually support other data forms in the posts (e.g., “hashtag” and “friendship” in Twitter), which enables the discoveries for underlying relationships of content and users. These distinct features make social media to be an ideal data source for systems of tracking hot themes and forecasting trending topics, applications relying on parameters based on population-level ob- servations, and analysis that looking into the factors driven behind the events. My research focuses on minings and discoveries of social media data, including on-going events early detection from real time Twitter streams, topical influences between social media and other data sources, discriminative learning to identify common and distinctive topics, event-chains gen- eration with Twitter user-created labels, and disease diffusion modeling and simulation based on monitoring of geo-aware tweets. These works can be used to various applications, which are close to classical research areas described as follows.

1 2

• Topic detection and tracking in social media. To detect and track emerging topics in social media steams, there exist two research directions under this area: general-domain event de- tection and targeted-domain event detection. General-domain event detection usually utilize unsupervised learning methods to track open-domain events. Lappas et al. [57] examined ways to discover terms that burst in geographical neighborhoods within a certain time pe- riod, considering content, structural, and temporal signals. Petrovic et al. [84] detected breaking news from Twitter data by building a nearest-neighbor tweet network and summa- rizing connected tweets into events. Supervised learning methods are commonly used in targeted-domain event detection. Typically, a classifier is trained via manually labeled data to identify tweets in the targeted domains, and then clustering techniques are applied to ana- lyze locations of found events. Sakaki et al. [98] first trained a SVM classifier to recognize tweets about “earthquake”, and then built a Kalman filtering model to detect the geographic regions of these events. Through a decision-tree strategy, Popescu et al. [87] utilized tar- geted named entities to decide whether corresponding snapshots indeed represent an event or not.

• Topic modeling. Latent Dirichlet Allocation (LDA) [10] has achieved great success in min- ing hidden topics in documents. Recently, with the development of online social media, there is increasing interest on mining short texts using topic models. Some existing work looked into the problem of how to apply standard topic modeling approaches in social media envi- ronments. For example, Hong et al. [35] tested several schemes to train their LDA model on short messages, and concluded that document length has a significant impact on the perfor- mance of standard topic models. Yang et al. treated Twitter topic modeling as a multi-class multi-label classification problem [122], which can be solved using a regularized logistic regression. Some other previous work has applied variations of LDA to capture the latent topics in social media data. For example, Zhao et al. considered each tweet to be associated with only one topic [126], rather than a topic mixture. Vosecky et al. [112] extended LDA to include multiple facets that jointly modeled terms and entities (e.g., “person”, “organiza- tion”, and “location”). Lin et al. [67] used “Spike and Slab” prior to deal with the sparsity problem of short texts, which allows documents to choose particular topics of interest.

• Transfer leaning. Transfer learning techniques usually first extract the knowledge from the source domain and then utilize the learned knowledge for tasks in the targeted domain [81]. There exist some approaches adopting transfer learning technologies for Twitter text mining. Jin et al. [43] developed a variation of LDA to jointly learn topics from both short and long texts. The knowledge shared by the two datasets is controlled by different settings of Dirichlet priors. Zhang et al. [124] first learned a latent semantic space from source dataset, and then mapped the target dataset to the space for the further mining tasks. Phan et al. [86] enriched Twitter with hidden topics learned from external data source such as Wikipedia and MEDLINE. This model is designed to find long texts related to given short texts, oppositely, our work aims to extract short tweet labels from given long articles.

• Topic evolution. Topic evolution methods aim to identify hidden topics and track topical 3

changes across timestamps. Most of the earlier work in this area, for example DTM [9], estimated current topic distribution through parameters learned from the previous epoch. In addition to methods based on Markov assumptions, there has been some work modeling the evolution of topics using time stamps generated from continuous distribution [116]. TAM model [47] is a hybrid of these two approaches, which captures changes via a property dubbed “trend class”, a latent variable with distributions over topic, words, and time. Hong et al. [36] developed a novel method by integrating topic volume dynamics [60] with topics shared by multiple text streams. Tsytsarau et al. [109] tried to address the problem of theme evolution by adding the hidden variables that can control volume evolution. • Storyline generation. Storyline discovery is a new emerging research direction recently. Shahaf et al. [100] proposed a metro-map format story generation framework, which first detected community clusters in each time window, and then grouped these communities into the stories. Yan et al. tracked the evolution trajectory along the timeline by emphasizing relevance, coverage, coherence and diversity of themes [121]. Mei et al. [73] proposed a HMM style probabilistic method to discover and summarize the evolutionary patterns of themes in text streams. Lappas et al. [56] designed a term burstness model to discover the temporal trend of terms in news article streams. Taking user queries as input, Lin et al. [65] first extracted relevant tweets and then generated storylines through graph optimization. Lin et al. [63] built a HDP (Hierarchical Dirichlet Process) model for each time epoch and then selected sentences for the storyline by considering multiple aspects such as topic relevance and coherence. Huang et al. identified local and global aspects of documents and organized these components into a storyline via optimization [42], while Zhou et al. modeled storylines as distributions over topics and named entities [127].

1.1 Research Issues

1.1.1 Twitter Event Detection

Twitter has become an important data source for detecting events, especially tracking detailed in- formation for events of a specific domain. Previous studies on targeted-domain Twitter information extraction have used supervised learning techniques to identify domain-related tweets. However, the need for extensive manual labeling makes these supervised systems extremely expensive to build and maintain. What’s more, most of these existing work fail to consider spatiotemporal factors, which are essential attributes of target-domain events. In this paper, we propose a semi- supervised method for Automatical Targeted-domain Spatiotemporal Event Detection (ATSED) in Twitter. Given a targeted domain, ATSED first learns tweet labels from historical data, and then de- tects on-going events from real-time Twitter data streams. Specifically, an efficient label generation algorithm is proposed to automatically recognize tweet labels from domain-related news articles, a customized classifier is created for Twitter data analysis by utilizing tweets’ distinguishing fea- tures, and a novel multinomial spatial-scan model is provided to identify geographical locations 4 for detected events. Experiments on 305 million tweets demonstrated the effectiveness of this new approach.

1.1.2 Underlying Factors behind Social Media and News

Mining and analyzing data from social networks such as Twitter can reveal new insights into the causes of civil disturbances, including trigger events and the role of political entrepreneurs and organizations in galvanizing public opinion. The analysis of interactions between social media and traditional news streams is becoming in- creasingly relevant for a variety of applications, including: understanding the underlying factors that drive the evolution of data sources, tracking the triggers behind events, and discovering emerg- ing trends. Researchers have explored such interactions by examining volume changes or infor- mation diffusions, however, most of them ignore the semantical and topical relationships between news and social media data. Our work is the first attempt to study how news influences social media, or inversely, based on topical knowledge. We propose a hierarchical Bayesian model that jointly models the news and social media topics and their interactions. We show that our proposed model can capture distinct topics for individual datasets as well as discover the topic influences among multiple datasets. By applying our model to large sets of news and tweets, we demon- strate its significant improvement over baseline methods and explore its power in the discovery of interesting patterns for real world cases.

1.1.3 Learning Common and Distinctive Topics from Multiple Datasets

Probabilistic topic models have been extensively studied to discover hidden patterns in the docu- ment corpus. However, rather than knowledge merely learned from a single data collection, many real-world application domains demand a comprehensive understanding of the relationships be- tween various document collections. To address such needs, this paper proposes a model that can identify the common and discriminative aspects of multiple data sets. Specifically, our approach is a Bayesian method that represent each document as a combination of common topics (shared by all document sets) and distinctive topics (distributions over words that are specific to a single data set). Through extensive experiments, our method demonstrates its effectiveness compared all existing models, and confirms its utility as a practical tool for “comparative thinking” analysis in real world cases.

1.1.4 Seeding Simulation with Updates from Social Media Data

In the 21st century, infectious diseases, such as H1N1, SARS, and Ebola, are spreading much faster than at any time in history, which causes an imminent threat for global public health. Effi- cient approaches are therefore desired to monitor and track the diffusion of these deadly epidemics. 5

Traditional compartmental models in epidemiology are able to capture characteristics of the dis- ease spreading through mathematical framework and contact network, however, unable to provide timely disease surveillance based on real-world data. Techniques focusing on emerging social me- dia platforms provide new opportunities to collect and utilize real-time disease data in public-level. But there is still lack of understanding for the underlying dynamics of ailment propagation. The framework proposed by this paper achieves efficient and accurate real-time disease surveillance through combining the computational epidemiologies with social media mining methodologies. Specifically, health status of individual users are first learned from their online posts via a novel Bayesian network, disease parameters needed for the computational models are then extracted through population-level analysis, and finally the outputs of computational epidemiology model are fed into the mining of social media data for further performance improvement. Through exten- sive experiments, we demonstrate our proposed model can outperform current approaches in the forecasting of diseases, and is extremely effective and efficient to explore the disease propagation process.

1.1.5 Storyline Generation using Social Media

Storyline detection aims to connect seemly irrelevant single documents into meaningful chains, which provides opportunities to better understand how events evolve over time and what trig- gers such evolutions. Most previous work generated the storylines through unsupervised methods, which can hardly reveal the underlying factors behind the evolution process. This paper introduces a Bayesian model to generate storylines from massive documents and infer the corresponding hid- den relations and topics. In addition, our model also attempt to utilize Twitter data as human inputs to “supervise” the generation of storylines. Through extensive experiments, we demonstrate our proposed model can achieve significant improvement over baseline methods, and can be used to discover interesting patterns for real world cases.

1.2 Goals and Contributions

Here we present an overview of the methods used in this dissertation. They will serve as the foundation to the novel models proposed in Chapters 2–6. Targeted domain Twitter Event Detection.

• Methodology for automatic label generation. Labels are generated from historical tweets, which are first ranked by various similarities to news documents, and then separated into positive and negative examples through an EM inferring algorithm. This method eliminates the need of using manually selected label data, and therefore reduces the cost associated with human input. 6

• Customized text classifier for Twitter data. To better analyze Twitter data, we utilize distinct Twitter features, such as hashtags, mentions, and replies to cluster tweets before classification. This attempt enables classification based on tweet groups rather than single tweets, which therefore greatly improves classification accuracy.

• Multinomial spatial-scan location estimation. We extend spatial scan statistics with multi- nomial distribution by combining factors from various location items (e.g., user-profile lo- cations or geo-tags). This approach makes maximum usage of all Twitter geographical in- formation.

• Extensive experimental evaluation and performance analysis. Our method was exten- sively evaluated on a real world dataset containing 305 million tweets. Compared to existing state-of-the-art methods, our method clearly demonstrated its effectiveness.

Interaction between Social Media and News.

• We propose a novel Bayesian model that jointly models the topics and interactions of multiple datasets. It is already known that knowledge learned from long articles (e.g., Wikipedia) can improve the learning of topics for short messages (e.g., tweets) [17, 85]. Our proposed model can easily transfer topical knowledge from news to tweets and improve the performance of both data sources.

• We provide an efficient Gibbs sampling inference for the proposed NTIT model. Gibbs sampling was chosen for the inference and parameter estimation of NTIT model for its high accuracy in estimations for LDA-like graphical model.

• We demonstrate the effectiveness of the proposed NTIT model compared to existing state-of-the-art algorithms. NTIT model is tested on large scale News-Twitter datasets associated with real world events. With extensively quantitative and qualitative results, NTIT shows significant improvements over baseline methods.

• We explore real world events by using our NTIT model to reveal interesting results. Our proposed model allows a variety of applications related to textual and temporal relationships. The learned estimations of hidden variables can be used for discoveries of various types of interests, such as key documents, topic differences, and topical influences.

Learning Common and Distinctive Topics from Multiple Datasets.

• A novel Bayesian model is proposed to simultaneously identify common and distinct topics among different datasets. The proposed CDTM model is the first graphical model to focus on identifying common and self-owned topics among multiple datasets, and can be used to develop a wide range of applications. 7

• An efficient Gibbs sampling inference is provided for the CDTM model. Gibbs sampling is utilized to estimate the parameters of the CDTM model due to its high accuracy when performing estimations for LDA-like graphical models.

• The effectiveness of the proposed CDTM model is demonstrated through extensive ex- periments. The performance of the proposed CDTM model is compared to those of the most important existing state-of-the-art algorithms on real-world datasets. Based on the ex- tensive quantitative and qualitative results obtained, the new CDTM model shows significant improvement over the baseline methods.

Seeding Simulation with updates from Social media data.

• A united framework that jointly using social media mining and social contact network simulation is proposed. The proposed SMS model is able to collect and analyze the most updated data from social media, at the same time, capable to infer the underlying propagation process like traditional computational model.

• A bispace learning model is provided to mining the disease diffusion patterns. Our SMS model is consisted of two spaces: social media space and simulation space. Different methodology is adopted in different space for the best performance. Meanwhile, information is shared efficiently across the spaces with well-designed learning strategies.

• A novel learning algorithm consisted of multiple inference technologies is provided.A variety of learning approaches are incorporated in SMS model, including Gibbis sampling, maximum likelihood estimation, and numerical optimization.

• Extensive experiments have been made to demonstrate the effectiveness of the proposed SMS model. SMS model is tested on large scale datasets compared to 6 existing state-of-art algorithms. With extensively quantitative and qualitative experimental results, SMS model shows significant improvement over both social media mining methods and computational epidemiology models.

Storyline Generation using Social Media.

• A novel Bayesian model is proposed to capture the features of real world events. ASG model represents storyline as a three-layer structure, and provides solutions to measure hid- den relations among storylines, events, and topics.

• Human input is incorporated into the storyline generation process. The rich up-to-date Twitter data provide the “cheapest” human made labels (hashtags), since they are publicly accessible. And ASG easily improves its efficiency by using these user-created Twitter hash- tags to filter redundant event types. 8

• An efficient Gibbs sampling inference is provided for the proposed ASG model. Gibbs sampling was chosen for the inference and parameter estimation of ASG model for its high accuracy in estimations for LDA-like graphical model.

• The effectiveness of the proposed ASG model is demonstrated through the comparison with existing state-of-the-art algorithms. ASG model is tested on large datasets associated with real world events. With extensively quantitative and qualitative results, ASG model shows significant improvements over baseline methods.

1.3 Organization

The rest parts of this document are organized as follows. Chapter 2 provides a semi-supervised method named ATSED, which aims to efficiently detect early events from Twitter data. Chapter 3 describes a generative framework for understanding the interactions between news and social media. Chapter 4 proposes CDTM model, an unified framework that can learn the common and discriminative topics for multiple datasets. Chapter 5 develops a hybrid model of social media mining and epidemics modeling to achieve flu prediction. Chapter 6 presents storyline generation framework named ASG model to organize massive documents into stories. Chapter 7 illustrates the research plan, current publications, and future work. Chapter 2

Twitter Event Detection

2.1 STED: Semi-Supervised Targeted Event Detection

2.1.1 Introduction

Microblogs (e.g., Twitter and Weibo) have emerged as a disruptive platform for people to share their daily activities and sentiments on ongoing events. The rich up-to-date sensing information al- lows discovering and tracking important events even earlier than news, with important applications such as public health and emergency management. Although identifying events from newspaper re- ports has been well studied, analyzing messages in Twitter requires more sophisticated techniques. Twitter messages are irregular, contain misspelled or non-standard acronyms, and are written in in- formal style. Additionally, tweets are filled with trivial events discussing daily life. Twitter’s noisy nature challenges traditional text event detection methods and therefore specifically designed event detection approaches are needed for Twitter text analysis. Most previous work on Twitter event detection has focused on general and large-scale (breaking news) events, such as Virginia Tech shooting and the Southern Califorinia wild fires. Unsupervised learning techniques, such as clustering, topic modeling, and burst detection, are mainly utilized. However, they have a limited power to detect small-scale events, such as city-level or even street- level protests or strikes. Recently, new attention has been paid to event detection of a targeted topic (e.g., civil unrests, disease outbreaks, or crimes). Supervised learning techniques are primarily applied, such as support vector machine and random forests classifiers. Although this work can detect small-scale events of the targeted topic, the requirements of expensive manual data labeling limit its efficiency and scalability. How to determine whether a tweet is interest-related or not is far more than simple keyword filtering. For example, if tweets related to shooting crime are required, feedback from Twitter for the query word ’shooting’ are motley: tweets like ’2 shot to death, 1 wounded: A shooting erupted at Mexico City airport’ are indeed related to shooting crime, but tweets like ’Shooting a music video’ in fact have nothing to do with gunfire.

9 10

In this demo, we propose a novel approach, semi-supervised targeted event detection (STED), which takes users’ specific interests as input, retrieves related tweets and summarizes events’ spa- tial and temporal features into visualization results. The major contributions are as follows:

• Automatical label creation and expansion: To avoid burdensome human efforts required in previous work, we propose a method capable of generating labeled data automatically, which first transfers labels from newspapers to tweets, and further expands initial Label subspace by Twitter social-ties. • Customized text classifier for Twitter: The noisy nature of Twitter data is a new challenge for text classification. Using tweet mini-clusters obtained by graph partitioning, we build a specialized support vector machine classifier for tweet analysis. • Enhanced location estimation technology: Utilizing tweet social-ties and fast spatial scan statistics, we propagate geo-labels within location clusters for event separation. • Visualization and analysis: Provision of event clusters, historical statistics, and related- tweets via a friendly interface promotes effective and efficient usage for human analysis.

Figure 2.1: System Framework of STED

2.1.2 Framework and Methods

As shown in Figure 2.1, the architecture of STED can be divided into these parts. By Extracting and Label-Generating modules, we transfer labels from news to Twitter to generate initial label data. Module Label Propagating utilizes Twitter social features to obtain extended label data. Graph- Partition module clusters initial single tweets into mini-tweet-cluster, and then training module build a Support Vector Machine(SVM) text classifier to identify targeted topic related tweets. Fi- nally, Spatial Scan and Location Propagating modules further group target topic related tweets into specific events according to location. 11

Automatical Label Creation and Expansion

In this step, we first automatically transfer labels from news descriptions to Tweets and further expand the initial label data by utilizing Twitter social-ties like Retweet(RT), Hashtag(#), and Men- tions(@). Term Extracting and Label Generating: We first collect domain specific news descriptions, such as news about crime, from public media. Though news reports are quite different from tweets in structure and expression style, elements that can specify an event remain the same: Named Entity and Action Word. By using NLTK 1, we extract Named Entity(noun) and Action Words(verb) from news description as candidate query word set for tweets. Given a query set as input, label generating module investigates Twitter data and selects tweets containing at least one Named Entity and one Action Word as positive label data. Label Propagating: Social-ties terms appear between tweets in the form of Mentions(@), Retweets(RT), and Hashtag(#). Tweets sharing common terms are more likely to discuss the same topic. We use social-ties terms to expand initial labeled dataset L obtained. First, we identify social-ties terms from labeled tweets, build Term-Tweet heterogeneous network S1, and remove less popular terms. As shown in Figure 2.2a, node degrees are approximately distributed in power law where most tweets are connected to few terms with high degree. These terms are expected to be more related to the event, while those low degree terms on the border are trivial. Then, we use the remaining popular terms as query to retrieve tweets, build Term-Tweet heterogeneous network S2, and fil- ter away terms with low ability to denote a specific event. Figure 2.2b illustrates an example of S2, a Hashtag-Tweet heterogeneous network, where the core term of the central cluster is hashtag ’#mexico’, surrounded with more newly found tweets (orange nodes) than initial labeled tweets (blue nodes). These terms should be filtered away by our system, since they are popular but shared by too many topics to represent specific interests. Finally, we build Term-Tweet network with fil- tered term set connected to new found label tweets S3, as shown in Figure 2.2c. Iterate process above, until no new tweet satisfying the conditions can be found. Through Label Propagating module, we obtain an extended label dataset for further processing.

Twitter Text Classification

In this part, we first apply graph partitioning method [118] to obtain event-related words groups and generate tweet mini-clusters, and then use support vector machine [44] for text classification. Graph Partitioning: Given word w, we first build its wavelet signal represented by the following sequence. fw = [ fw(T1), fw(T2),..., fw(Tn)] (2.1) where fw(Ti) is the TF-IDF score of word w during the period Ti . In this paper, to capture daily event emergence, we set duration of Ti to be one hour and number of segment n to be 24. Then, we

1http://nltk.org/ 12

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.2 0.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 (a) Term-Tweet Heterogeneous Network (b) Hashtag-Tweet Heterogeneous of Label Tweets S1. Network of Total Tweets Space S2. 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.1 0.0 0.2 0.4 0.6 0.8 1.0 (c) Term-Tweet Heterogeneous Network of New Found Tweets and Filtered Terms S3. Figure 2.2: Tweets’ Social Ties Networks. Big nodes represent terms: Red nodes are hashtags, blue nodes are mentions, and yellow nodes are Retweets. Small nodes denote tweets: blue ones are labeled tweets, orange nodes are newly found tweets from raw data. Edge (i,t) means tweet t contains term i. 13

compute the auto correlation Aw for each word w and filter away trivial words(appearing evenly day by day). From above, we get subset Ψ of rare and note worthy words. Next, we calculate cross- correlation Xi j of each word-pair in Ψ and construct a correlation matrix Γ containing all word pairs. This correlation matrix Γ can be viewed as a graph and related-word clustering becomes a graph partition problem: We apply graph partitioning [71] on correlation matrix Γ to obtain subgraphs that words within one subgraph are highly similar in form of high cross correlation, while words in different subgraphs have low cross correlation. Finally, tweet clusters are generated by obtained word groups: tweet containing at least two items of word group Gi can be considered as an item of tweet cluster Ci. Classifier Training: The most important part for classifier training is feature selection. Words appear less than threshold ζ are first filtered out. Next, we calculate TF-IDF scores for words and filter out trivial words such as ’people’, ’love’, which appear more frequently in total Twitter space than labelled tweets space. Besides, to avoid overfitting problem, most words from the Named Entity set E should be removed, since they enjoy high TF-IDF scores and will potentially be assigned heavy weights in the SVM classifier, but only represent one specific event.

Location Estimation

To estimate event locations, we first identify spatial clusters by fast spatial scan technology[78]. However, only 2% of tweets contain such geographic information. To make best use of the minority of tweets with geo-labels as well as the majority without labels, we further propagate geo-labels within each cluster to amplify spatial signal. Spatial Scan: Geo-locations of tweets about a certain event are likely around the event’s occurring location. We apply spatial scan statistics to detect significant spatial clusters, as shown in Figure 2.3. Specifically, we aggregate the count of event related tweets in city level and define the base of each city as the total count of the original tweets. Then we apply fast subset scan [78] to identify a set of H candidate clusters with the largest Kulldorff’s statistics [51], which is defined as

Ca −Cr Cr Ca Kr = (Ca −Cr)lg( ) +Cr lg( ) −Ca lg( ) (2.2) Ba − Br Br Ba where Ca and Ba refer to the total count and base in the country, respectively; and Cr and Br refer to the count and base in the spatial region r, which is a set of neighbor cities. The empirical p- value of each candidate cluster is estimated by random permutation testing, and the clusters with empirical p-values smaller than a threshold η (e.g.η = 0.05), are returned as significant clusters. The parameter H is usually set greater than the maximum number of potential clusters that may exist, and the insignificant clusters can be filtered out later by randomization testing. Location Label Propagating: Within each cluster, we further label tweets that lack geo informa- tion using social ties. Tweets contain common terms such as hashtag and mention are more likely m to occur in the same location. We first compute a score ω = i j by tweets with geo-labels to i j Mi denote the relativity of term i and location j, where mi j is number of tweets contain term i as well as location j, and Mi is the count of tweets contain term i. 14

20

15

10

5 Number of Tweets

0 −110 −105 −100 −95 30 28 24 26 −90 22 18 20 Longitude Latitude

Figure 2.3: Example of Tweet Location Clusters. Red nodes denote the highest density of tweets of locations.

lt = max{1 − (1 − ωi j)} (2.3) j∈ϕ ∏ i∈φt

Then, using Equation 2.3, we estimate location lt of unlabelled tweet t, which contains a set φt of terms. We first compute the relativity between tweet t and each location j from location set ϕ, and then pick up the biggest value as this tweet’s estimated location.

2.1.3 Demonstration

We showcase STED system using Twitter data of Latin America as the example application. The considered database is more than 400GB in size, from June, 2012 to Jan, 2013. With application to civil unrest event detection, STED archived 72% in precision of and 74% in recall, with leading time of 2.42 days ahead of traditional media. We implemented STED interface by Python and provide users with following interface:

• Map-visualization of targeted-interest event clusters in city-level. • Detailed information about each event cluster, including related tweets and word cloud sum- . • Graphical analyses of historical statistics, including spatial comparison among regions and temporal trend within one region. 15

Figure 2.4: Interface of STED system

Figure 2.5: Historical Analysis Screenshot 16

Figure 2.4 shows a screenshot of STED interface. With STED system, a user can search for events with regard to their specific interests and analyze their spatial and temporal features. A targeted- interest includes time, location, topic and keywords. Users are allowed to choose date and topic as well as typing in keywords, in the right part of interface. As an ongoing project, we have applied our method to detect interests of crime and unrest. As shown in the screenshot, users’ interest is detecting events about type ’Civil Unrest’ in country ’Mexico’, with keyword ’protest’, from date ’2012-07-06’ to ’2012-07-07’. After users click on the ’Search’ button, STED will return corresponding event clusters of targeted-interest, shown as balloons. By clicking on one of the balloon, users can find detailed information of corresponding event from left part of the interface: tweets ranked by their relativity to users’ interests and word cloud event summary denoting terms’ relative importance. System feedback of given interest shown in the screenshot reveals that there was a march (Spanish word ’marcha’ in word cloud) held by YoSoy132 to protest president election results, which is also reported by public media. It is also possible to study targeted interested events spatially and temporally, by using historical statistics analysis interface. Given a city and historical period range, users can track interest-related event trend of this city. In the bottom of Figure 2.5, we show interest-related event summary of given city ’Mexico City’, from historical date ’2012,July’ to ’2012,December’. Users can also compare spatial features of interested event. In the upper part of Figure 2.5, we list top-10 cities in given country ’Mexico’ with restrict to historical event number.

2.1.4 Conclusion

We present STED, an interactive system to help users find targeted events from noisy and com- plex Twitter data. By automatically generating pseudo-label data, our system significantly reduces the human workload as required in previous work for targeted event detection. A customized text classifier based on mini-clustering and enhanced location estimation are further designed for event characterization. Using data collected in Latin America countries, we demonstrate the effective- ness of STED in detecting targeted events such as civil unrests and crimes. Visualization and interaction features of our system also help users efficiently track and interpret the detected events of their interests.

2.2 Automatic Targeted-Domain Spatiotemporal Event Detec- tion in Twitter

2.2.1 Introduction

Online social microblogs such as Twitter have become a major medium for information sharing. The rich up-to-date sensing data in Twitter allows important events to be discovered and tracked 17 prior to their inclusion in standard news bulletins.When a social event occurs, traditional media usually take hours or even days to report the related news, while the corresponding information may begin to spread immediately after the occurrence in social media like Twitter [110, 119]. For example, Figure 2.6 depicts the number of tweets and news reports related to a spatiotemporal event (a protest held by local residents) that happened around 12 noon on January 12th, 2013 in Mexico. Number of event-related tweets immediately increased after the event began (12 noon), while the first news report was published at 2 pm, 2 hours later than the tweet burst.

160 8 140 Tweets News 7 120 6 100 Event Occurrence Time 5 80 4 60 3 40 2 20 1 0 0

Figure 2.6: Number of tweets and news reports related to a protest event occurring at around 12 pm on January, 12, 2013 in Mexico.

Although detecting events from formal texts has been extensively studied [12, 24], analyzing mes- sages from Twitter requires more sophisticated techniques. First, newswire texts are relatively long and well written, while Twitter messages are short and written in a much more informal style. It is therefore unrealistic to simply apply traditional news-text based event detection methods on Twitter data. What’s more, events mentioned in news documents have already been identified as being of general importance, but in the case of Twitter data, nearly half of all tweets are actually non-event related babbles discussing the minutia of daily life. In previous studies of Twitter event detection, most researchers have adopted general-domain event detection approaches to extract popular open-domain events, without imposing specific constraints on event type. These methods generally utilize unsupervised learning techniques, such as clus- tering [62, 118], topic modeling [123], and burst detection [57], all of which can catch breaking news yet will not normally identify relatively small-scale spatiotemporal events. However, differ- ent users may demand different information from Twitter. For instance, companies need feedback about their products from customers, governments seek data related to social events (such as crime [64], civil unrest, and disease outbreaks [102, 7]), and scientists are interested in collecting tweets about natural disasters [98] or climate changes . We call these demands related to tracking in- formation in a specific domain targeted-domain event detection. Existing targeted-domain event 18 detection methods have applied supervised learning techniques (e.g., SVM) to differentiate event- related tweets from non-event relevant contexts [64, 98]. However, these methods suffer from the following shortcomings: 1) Highly relying on manually-labelled data. To build a training dataset for supervised learning, these technologies require extensive human input to label tweet data cor- rectly, and to maintain good system performance, these label datasets must be updated regularly. Each day, more than 200 million active Twitter users publish over 400 million tweets 1. This huge volume of data makes periodical updates extremely expensive and even unrealistic. 2) Inability to utilize Twitter’s distinct features. Classifiers designed by existing methods usually treat Twitter data as a set of plain textual documents, without any consideration of Twitter network properties such as "mentions", "hashtags", and "replies". In Twitter data, "hashtags" can be used to denote tweets about the same topics, one user can "mention" another user, a tweet can be "replied" by another tweet. 3) Restricted ability to estimate event location. Existing methods usually predict event location through single location terms that either involve user locations [64] or GPS tags [98], discarding all other types of geopolitical terms. Instead, our proposed multinomial spatial scan considers all possible Twitter location terms, including registered locations in user profiles, GPS information, and geo-tags mentioned in the tweet content.

Date Location Top 5 Event Phrases

1/10/2012 Mexico/Mexico City peasant , camp, Tabasco, protest, 900 1/10/2012 Brazil/São José GM , protest, layoff, Jaime, close 1/12/2012 Mexico/Mexico City dogs , protest, #yosoycan26,march,Zocalo 1/13/2012 Mexico/Hermosillo Padrés , tax, protest, Sonora, congress

Figure 2.7: ATSED output example.

In this paper, we propose a semi-supervised approach for detecting spatiotemporal events from Twitter, named Automatical Targeted-domain Spatiotemporal Event Detection (ATSED). Figure 2.7 is an illustrative example of our model output. Given historical news reports related to a specific domain, such as "civil unrest", ATSED can yield a set of real-time "civil unrest" events detected from Twitter, which are consisted of key information such as location, date, and brief description. First, utilizing the knowledge learned from news reports, ATSED can automatically generate labels from historical Twitter data. These Twitter labels are then served as training data for a classifier specially designed for Twitter data analysis. Next, the trained classifier can be applied to real-time Twitter data steams to identify event related tweets. Finally, event locations are extracted from event-related tweets through a novel multinomial spatial-scan method. In summary, this article makes the following contributions:

1https://blog.twitter.com/2013/celebrating-twitter7 19

• Methodology for automatic label generation. Labels are generated from historical tweets, which are first ranked by various similarities to news documents, and then separated into positive and negative examples through an EM inferring algorithm. This method eliminates the need of using manually selected label data, and therefore reduces the cost associated with human input.

• Customized text classifier for Twitter data. To better analyze Twitter data, we utilize distinct Twitter features, such as hashtags, mentions, and replies to cluster tweets before classification. This attempt enables classification based on tweet groups rather than single tweets, which therefore greatly improves classification accuracy.

• Multinomial spatial-scan location estimation. We extend spatial scan statistics with multi- nomial distribution by combining factors from various location items (e.g., user-profile lo- cations or geo-tags). This approach makes maximum usage of all Twitter geographical in- formation.

• Extensive experimental evaluation and performance analysis. Our method was exten- sively evaluated on a real world dataset containing 305 million tweets. Compared to existing state-of-the-art methods, our method clearly demonstrated its effectiveness.

2.2.2 Related Work

This section reviews research directions related to our work. The first branch consists of detection methods that have been widely used in tracking events from news stream. Recently, event detection on social media streams becomes a hot research topic. Existing event detection algorithms can be broadly classified into two categories: general domain and targeted domain approaches. Besides, in aspect of automatical label generation, our approach is related to distant supervision and transfer learning.

Event detection in newswire documents

Much research has focused on detecting events from formal texts, such as news articles, blogs, and emails. Some of these approaches group documents into events based on their semantic similarity. Brants et al. [12] built an event detection system based on incremental TF-IDF model, identifying events by caculating the Hellinger distance between new texts and previous documents. Kumaran et al. [52] took a different approach, detecting new events by extending cosine similarity and the vector space model to include story categorization and the use of named entities. Other researchers have sought to first identify event-related features and then cluster feature bursts into events. For example, Fung et al. [24] proposed a way to identify events that consist of a set of bursty fea- tures appearing simultaneously. Their model treats bursty features as a time-series of probability, and then groups strongly interrelated bursty features into bursty events. Bursty features are first 20 evaluated by their distributions, and strongly interrelated bursty features are then grouped to create bursty events. While news event detection methods work well for formally written news articles, they are in- capable of detecting events from social media data like tweets. Tweets are very short and often written informally with abbreviations and mistakes. More sophisticated technologies that can han- dle noisy Twitter data are therefore desired.

General-domain event detection in Twitter

In order to detect emerging general events in Twitter steams, general-domain event detection usu- ally applies unsupervised learning techniques, such as topic modeling, burst detection, and clus- tering. Topic modelling is a particularly popular solution, since event detection in Twitter data is similar to the problem of topic detection in formal texts. For example, Yin et al. [123] devel- oped topic modeling techniques to detect geographic topic clusters in local regions. Cataldi et al. [16] proposed the use of a graph for topic detection, using emerging terms in tweets posted by authoritative users. Ritter et al. [96] prefer to focus on extracting events from noisy Twitter data and then generating event categories based on latent variable models. Another alternative is to detect events through spatial-temporal word bursts. Lappas et al. [57] examined ways to discover terms that burst in geographical neighborhoods within a certain time period, taking into account content, structural, and temporal signals. Clustering technologies have also been utilized for event detection. For example, a recent study applied wavelet analysis for noise filtering in Twitter and identified word groups with high correlations as indicators of an event [118]. Petrovic et al. [84] detected breaking news from Twitter data by building a nearest-neighbor tweet network and sum- marizing connected tweets into events. Our goal differs from that of general-domain event detection as we are seeking to detect events in a particular domain, such as earthquakes, disease outbreaks, social unrest, or crimes. From the perspective of detection, our work is most closely related to targeted-domain event detection and its approaches.

Targeted-domain event detection in Twitter

Supervised learning methods are commonly used in targeted-domain event detection. Typically, a classifier is trained via manually labeled data to identify tweets in the targeted domains and then clustering techniques are applied to analyze the events’ locations. Sakaki et al. [98] first trained a SVM classifier to recognize tweets about "earthquake", and then built a Kalman filtering model to detect the geographic regions of these events. Similarly, Li et al. [64] focused on crime event detection from Twitter, by training a classifier with crime domain keywords and Twitter-specific features (e.g., hashtags). Popescu et al. [87] utilized targeted named entities and a decision-tree strategy to decide whether corresponding snapshots do indeed represent an event. Becker et al. [5] began by clustering similar tweets, and then applied a manually trained classifier to identify 21 different events, based on features such as hashtags and retweets. Zhang et al. [124] utilized labeled documents from a source domain to help build latent semantic space for short texts in the target domain. Unlike the method presented here, their methods all require extensive labeled data in the source domain. Due to their supervised nature, existing methods aimed at detecting targeted events usually require expensive human effort to create suitable labeled data. Our previous work proposed a method that capable of identifying trustworthy tweets [125]. In this work, we attempt to build an appropriate label dataset automatically, and utilize these automatically generated data for detecting spatiotem- poral events.

In summary, traditional event detection methods are suitable for news documents, but works poorly in noisy Twitter data. Most of previous work on social media detection are general-domain ap- proaches. General-domain event detection methods are to identify breaking news, which are most popular events during a certain period of time, regardless the specific event type. There exist only few detecting methods that are able to recognize events of targeted event types, which are most closely related to our proposed ATSED. But unlike ATSED, none of these targeted-domain detec- tion systems are capable of automatically detecting social media events without pre-given human labeled data. And few of these previous work focused on spatiotemporal event detection and made poor utilization of tweets’ location information.

Distant supervision and transfer learning

Transfer learning techniques usually first extract the knowledge from the source domain and then utilize the knowledge for tasks in the targeted domain [81]. There exist some approaches adopted transfer learning technologies for Twitter text mining. Jin et al. [43] developed a variation of LDA to jointly learn topics from both short and long texts. The knowledge shared by the two datasets is controlled by different settings of Dirichlet priors. Zhang et al. [124] first learned a latent semantic space from source dataset, and then mapped the target dataset to the space for the further mining tasks. Phan et al. [86] enriched Twitter with hidden topics learned from external data source such as Wikipedia and MEDLINE. This model is designed to find long texts related to given short texts, oppositely, our work aims to extract short tweet labels from given long articles. Distant supervision methods heuristically label corpus using supervision from known knowledge base [99]. Mintz et al. [75] use existing relations in external knowledge base as training data. For each entities pair, they collected all the sentences mentioning them in text, and use their relation type in knowledge base as label. Based on these generated labels, they trained a classifier to learn relations. Purver et al. [89] used some heuristical intuition (emotional marker) to generate noisy labels first, and then examined the classifiers trained by these pseudo labels. The intuition behind distant supervision, transfer learning, and our proposed method is that: some hidden patterns and relationships are shared by source and target datasets, and the learned knowl- 22 edge from the source is likely to appear in the target data in some way. Our method can be view as distant supervision as we generated pseudo labels with heuristical rules first, and demonstrated our good performance despite the imperfect labels. Most distant supervision methods are proposed to the relationship between entities or words [95, 75], under the supervision of large knowledge base. But our goal here is to study the relationship between events and words, and the supervisor is external document dataset (similar to transfer learning to some extent).

2.2.3 Framework and Problem Formulation

This section first introduces the framework of ATSED, then formally describes some key concepts used in this paper, and finally define the tasks of this paper based on these concepts.

Framework

Our framework consists of two main components: label generation and spatiotemporal event de- tection. The input data sources contain: historical Twitter data, historical news articles, and real time Twitter streams. Historical Twitter data and news articles are used by label generation com- ponent to produce pseudo labels. Spatiotemporal event detection module then trains classifier through these labels and detects events from real time Twitter data. In the label generation component, tweet labels are generated utilizing historical news articles. Based on the labels generated from the historical data by the label generation module, the spa- tiotemporal event detection module can now move on to identify on-going events related to the targeted interest from real-time Twitter streams.

Events Historical Real-Time Tweets Twitter Data

News Feature Relevancy Label Tweet Location Reports Extraction Ranking Refinement Classifier Estimation

Automatic Label Generation Spatiotemporal Event Detection

Figure 2.8: ATSED system architecture.

The label generation module can produce both positive and negative tweet examples with knowl- edge learned from given news report documents. First, the submodule feature extraction detects domain-feature domain words and event-feature event words from news reports. Next, the domain words and event words are utilized as queries to search Twitter data. Then, a relevancy ranking method is proposed to evaluate tweets’ relevancy to the given event, based on the spatial, tempo- ral, and textual similarities between tweets and event-related news documents. Tweets with high 23 relevancy scores are considered as candidates for positive examples, while tweets with low scores are potential negative examples. Finally, an expectation maximization (EM) label refinement algo- rithm is provided to further separate the positive and negative examples. The Twitter classifier submodule combines clustering and classification. Tweets in the real-time Twitter data stream are first clustered into mini-tweet-groups, utilizing tweets’ social ties such as hashtags, mentions, and replies. Next, clustered tweet groups are input into the trained classifier (using historical labels from the label generation module), which identifies the positive and neg- ative classes for tweets. In the location estimation submodule, an extended spatial scan approach is harnessed to cluster tweets in the positive class into different spatiotemporal events. As a result, each event detected by ATSED is represented by location, timestamp, and event-related tweets.

Problem Formulation

Corresponding to the framework introduced above, targeted-domain Twitter event detection can be formally defined in terms of two tasks, label generation and spatiotemporal event detection, beginning with a few key concepts as follows. First, different from trivial daily life events, events mentioned in our paper are something “signif- icant”. These events should be discussed in public media and associated with some news articles, since they are significant.

Definition 2.2.1 (Spatiotemporal Event) An spatiotemporal event x = (l,t) is a significant real- world incident that happened at location l and time t. Domain Xp is defined as a set of events falling into the same domain p, such as music, sports, civil unrest, etc.

Definition 2.2.2 (Article) The article set of targeted domain p is designated Ap, while the set of open-domain articles (containing various topics) is designated A. An article ax ∈ Ap denotes a news report document about event x. Notice that one event may be associated with multiple news reports, so we merge these documents into one article.

Suppose we are interested in detecting events in the targeted-domain “civil unrest”. For example, the event “dog protest” happened on January, 12, 2013 in Mexico 2. A segment of the event- related news article is as follows (the original Spanish text has been translated into English using Google Translate):

Accompanied by a dozen of dogs, about 150 people of the movement YoSoyCan26 marched around the Zocalo of Mexico City, and insisted 57 dogs that were captured as the homicides in Cerro de la Estrella be freed.

2 http://www.milenio.com/cdb/doc/noticias2011/fcd1c695e4a21d7edcae432c9f931ecd?quicktabs1 = 2 24

Besides news articles, when an event occurs, there could also exist some tweets that relevant to the given event. Among these event-related tweets, some are truly relevant to the given event. For example, tweet “With protests in the Zocalo, # YoSoyCan26 requires Iztapalapa dogs to be free.”3 is a positive tweet to event “dog protest”. In contrast, negative examples are tweets that share some features with positive ones yet are in fact irrelevant to the given event. For example, the tweet “I do not understand social networks. Fuss over a dog, I have not seen it to help people in the street.”4 has some positive features (e.g., “dog” and “street”), but fails to provide any information related to the given protest event.

Definition 2.2.3 (Tweet) A tweet y = (d,l,t) contains textual document d, location l and time- stamp t. Twitter data stream Y is therefore defined as a set of tweets.

Definition 2.2.4 (Positive Tweet) A tweet y(x) = (d,l,t) containing textual document d, location l and time-stamp t is a positive tweet to event x, if it is truly related to event x.

Definition 2.2.5 (Negative Tweet) A tweet y¯(x) = (d,l,t) contains textual document d, location l and time-stamp t.

With concepts of “event”, “article”, and “tweets”, we can further define the concept “label” used in this paper, which consists of event, event news article, and event tweets.

Definition 2.2.6 (Label) A label z is defined as (x,Y(x),Y¯ (x)), where x is an event, Y(x) is the set of (x) (x) (x) tweets related to event x, and Y¯ are irrelevant tweets. The label set Zp = {(x,Y ,Y¯ )|x ∈ Xp} for target domain p consists of labels generated from events Xp in domain p.

Given a list of historical events and corresponding newswire documents, the task of label genera- tion is to determine the set of tweets related to each event.

Task 1 (Label Generation) Given an event set Xp and a news article set Ap, where each event xi ∈ Xp has a corresponding news article axi ∈ Ap, the goal of label generation is to find label set (x) (x) Zp = {(x,Y ,Y¯ )|x ∈ Xp}, from historical tweets Y. Note that, both the tweets and news articles used in the label generation module consist of historical data. In contrast, spatiotemporal event detection discovers newly emerging events in the targeted domain, therefore Twitter data used in spatiotemporal event detection consist of real-time data streams.

Task 2 (Spatiotemporal Event Detection) Given a label set Zp (product of Task 1) and real-time 0 0 Twitter stream Y , the event detection algorithm aims to identify an on-going event set Xp for 0 0 0 targeted domain p from Twitter data stream Y . Each spatiotemporal event x ∈ Xp consists of 0 0 0 (x ) location l , time t , and event-related tweets Ip .

3https://twitter.com/BicitlanRadio/status/290232591246823425 4https://twitter.com/revistaeneo/status/290185989815676930 25

2.2.4 Automatic Label Generation

In this section, we discuss Automatic Label Generation (ALG) algorithm in detail. First, ALG extracts feature terms from news reports, then ranks tweets based on their similarities to the news reports, and finally splits the tweet set into positive and negative examples through an EM based refinement algorithm.

Feature Extraction

The goal of feature extraction is to obtain features that can identify a specific event in the targeted domain. Although tweets and news articles are quite different in writing style, they are likely to share some semantic features when describing the same event, which are referred to as domain words and event words in this paper. Domain words are those most representative words for events occurring in a certain domain. For example, the words "protest" and "march" may be domain words for "civil unrest" events. Event words are words that can distinguish a particular event from other events in the same domain. In the above mentioned news article ("dog protest" event), the words "YoSoyCan26" and "Zocalo" are event words which are highly relevant to the specific event. To identify "domain words" and "event words", we define domain weight and event weight as follows.

Definition 2.2.7 (Domain Weight) Domain weight C(wi, p) quantifies the ability of word wi in n representing targeted domain p. Given targeted-domain news article set Ap = ∪i=1axi and an open-domain document set A,C(wi, p) is computed as the product of two parts, namely the nor- malized term frequency f (wi,Ap) of word wi in open-domain set Ap, and the inverse document frequency of wi in targeted-domain set A :

f (wi,Ap) |A| C(wi, p) = × lg( ). (2.4) max{ f (w,Ap) : w ∈ Ap} |{a ∈ A : wi ∈ a}| + 1

Definition 2.2.8 (Event Weight) Event weight E(wi,x) quantifies the ability of word wi in distin- guishing event x from other events in the same domain. It is computed as the product of two parts, the term frequency of word wi in event article ax, and the inverse document frequency of wi in document set Ap :

f (wi,ax) |Ap| E(wi,x) = × lg( ). (2.5) max{ f (w,ax) : w ∈ ax} |{a ∈ Ap : wi ∈ a}| + 1

At the beginning, we compute domain weight and event weight for all words in Ap. Namely, both domain words set and event words set are equal to set Ap. MAD algorithm [114] is adopted to decide thresholds that can remove trivial words. After applying the hard threshold filtering, only words with values (domain weight or event weight) bigger than the thresholds are kept in 26

the corresponding set. Taking "domain words" for example, domain weight threshold ηc can be calculated as follows.

δc = median(| f (w,Ap) : ∀w ∈ Ap|), (2.6)

ηc=δc + αc × median(| f (w,Ap) − δc,∀w ∈ Ap |). (2.7)

As shown in Equation(2.7), parameter αc determines the value of threshold ηc. When αc is set to be too small (e.g., 0.1), trivial words such as "yesterday", "adult", and "down" are selected as domain words. Oppositely, a large value of αc will remove important words. As suggested by Leys et al. [61], value of αc can be set as 1/Q(0.75), where Q(0.75) is the 0.75 quantile of the distribution. Therefore, we set αc to be 3.97 (ηc = 0.087), which returns a medium-size domain word set that contains 52 words. Similarly, threshold δe computed by the MAD algorithm is to remove trivial words from the event words set. The domain words and event words that have been extracted from news reports can now be used as queries to search Twitter data. Only tweets containing at least one domain word or one event word are retrieved and sent to the next module, relevancy ranking.

Relevancy Ranking

The relevancy ranking module evaluates the relevancy between tweets and events. To compute this, "total" relevancy is factorized as a product of three similarity subfactors: textual, spatial, and temporal similarity.

Textual Similarity

As shown in Equation (2.8), the textual similarity φx,y between event x and tweet y is defined as the product of tweet words’ domain weight sum and event weight sum:

φx,y = ∑ C(wi, p) × ∑ E(wi,x), (2.8) (P) (x) wi∈(dy∩WC ) wi∈(dy∩WE ) (P) where dy is the context of tweet y. Only words in the domain word set WC are considered when (x) calculating the domain weight sum, and only words in the event word set WE of event x are considered when computing the event weight sum. The rationale behind the formula is as follows.

• Sum of domain/event word weights. A tweet is more likely to be event-related, if it contains more domain words and event words. To accumulate effects of individual words, both the first and the second term in Equation (2.8) take the form of word weight sum. 27

• product of weight sums. Only tweets containing both domain words and event words in a sufficient way are qualified to be event-related. A tweet with many domain words but few event words may discuss other events in the same domain. While a tweet with many event words (e.g., event location name) but few domain words may relate to events in other domains (e.g., something that also happened in the same location). To balance the effects of domain words and event words, Equation (2.8) multiply domain weight sum with event weight sum.

Spatial Similarity

The spatial similarity between event x and tweet y is decided by two factors: 1) the distance between tweet location ly and event occurrence location lx, and 2) the spatial influence scope of tweet y. The first factor is to relate event and tweet in the same location. An event and a tweet are more likely to be relevant if they are close in distance. The second factor further enhances the event-relevancy for tweets with high textual-similarity scores. Intuitively, within the same distance to event occurrence location, tweets of higher textual-similarity scores are more likely to be event- related. Therefore, a tweet y’s spatial influence for event x is modeled as a Gaussian distribution   φx,y 0 ϕx,y = N(ly,∑x,y), centered at tweet y’s location ly, with influence scope ∑x,y = , 0 φx,y where φx,y is the textual similarity defined in Equation (2.8).

Temporal Similarity

After the initial burst of tweets upon the occurrence of a particular event, the number of event- related tweets usually decreases as a Poisson process [98]. In other words, the possibility of tweet y being related to event x decreases as time goes by, which indicates the likelihood that an indi- vidual tweet related to the event also decreases following a Poisson process. Therefore, temporal similarity between tweet y and event x can be described as an exponential distribution:

−λ|tx−ty| ρx,y = λe , (2.9) where tx is the occurrence time of event x and ty is the publishing time of tweet y.

By integrating the textual, spatial, and temporal similarities, the event-tweet relevancy ψx,y is ranked by the following function:

ψx,y = φx,y · ϕx,y · ρx,y. (2.10) ∗ For a tweet y, we choose event x that maximizes event-tweet relevancy ψx,y as its most correlated event: ∗ x = g(y) = argmaxψx,y. (2.11) x∈Xp Correspondingly, for each event x, its related tweet set Y˜ (x) is identified through an inverse process, that each element tweet y(x) in set Y˜ (x) satisfies g(y(x)) = x. 28

Label Refinement

The initial event-tweet pairs obtained using the procedure outlined above contain a great deal of noisy data. Although top ranked tweets are indeed highly related to the corresponding events (positive examples), many of the low ranked tweets are in fact irrelevant (negative examples). However, it is difficult to set a uniform threshold suitable for all events to separate the positive and negative tweets. One alternative is to cluster tweets based on their similarities, by assuming that positive examples are more similar to each other than negative ones. Suppose we have a set of positive tweets and a set of negative examples. Based on these existing label sets, the labels of other tweets can be inferred based on their similarities. However, the assumed positive and negative sets of existing labels are actually unknown. This turns out to be an inference dependent problem: the inference of a single tweet’s label depends on the existing positive and negative sets, while constructing positive and negative sets depends on the assignment of each tweet. Therefore, an EM-based inference algorithm is developed and applied here to solve the "inference dependency" problem. ˜ (x) (x) ˜ (x) For an event-tweets pair (x,Y ), each tweet y j ∈ Y is represented by a n-dimensional feature (x) ˜ (x) vector v j , where n is the total number of words in the event related tweets set Y . An element (x) (x) ˜ (x) (x) v jw ∈ v j is set to be h, if word w ∈ Y appears h times in tweet y j . The tweets’ relevancy distribution is modeled as Q-Gaussian mixtures, in which the qth Gaussian is denoted as Gq = N(µq,Σq) with mixing coefficient θq. The goal is to maximize the likelihood function: n n Q ˜ (x) (x) (x) p(Y ) = ∏ p(v j ) = ∏ ∑ θq · N(v j |µq,Σq). (2.12) j=1 j=1 q=1 (x) E-step. In the E-step, given the estimates of parameters µ and Σ, the probability of v j belonging to Gaussian Gq is calculated as follows: (x) θqN(v |µq,Σq) p(G |v(x)) = j . (2.13) q j Q (x) ∑ θm · N(v j |µm,Σm) m=1 M-step. In the M-step, by taking partial derivatives of Equation (2.12), estimations of parameter are renewed as follows: n (x) (x) ∑ p(Gq|v j )v j ∗ j=1 µq = , (2.14) n (x) ∑ p(Gq|v j ) j=1 n T p( |v(x))v(x)(v(x) − µ∗)(v(x) − µ∗) ∗ ∑ Gq j j j q j q j=1 ∑ = n , (2.15) q (x) ∑ p(Gq|v j ) j=1 29

n (x) ∑ p(Gq|v j ) j=1 θ∗ = . (2.16) q n

First, tweets set Y˜ (x) is split into Q parts with a descending order based on the initial Gaussian mixtures. Then E-step and M-step are conducted iteratively. When convergence is achieved, the Gaussian group with the maximum value of relevancy score is selected as the positive examples set Y(x), while tweets in other Gaussians are treated as negative examples Y¯ (x). This accomplishes (x ) (x ) Task 1 label generation, as for each event xi ∈ Xp, label zi = (xi,Y i ,Y¯ i ) can be generated from the historical Twitter stream Y through the above process.

2.2.5 Spatiotemporal Event Detection

In this section, we discuss the detection of newly emerging spatiotemporal events from real-time Twitter data streams. A tweet classifier is first trained by using historical event-tweets labels gen- erated according to the previous section. Then, tweets of positive class (output of the classifier) are grouped into geo-clusters (events) by applying a multinomial spatial scan method.

Tweet Classifier

Different from traditional text classifier, our proposed tweet classifier consists of two parts: social- ties clustering and mini-tweet-group classification. We first clustered tweets into mini-groups based on their social-ties, and then conduct SVM-based classification to these tweet-mini-groups. Social Ties Clustering This clustering process is applied to both event-tweets labels (training data) and the incoming Twitter data stream (testing data). The basic idea here is that tweets sharing common social ties (e.g., mentions, replies, and hashtags) are more likely to be about the same topic. To cluster tweets through social ties, a tweet-tie heterogeneous graph is built and then split into small subgraphs by applying graph partition. As shown in Figure 2.8, tweets are connected by social ties to create a tweet-tie heterogeneous graph Λ = (Y,E). Y is the tweet set, denoted as small nodes in Figure 4, and E is the edge set, where each edge ei j is the number of shared social-ties between tweet yi and y j. Our goal is to partition the entire graph Λ into a set of subgraphs P such that connections are strong within one subgraph yet are weak across different subgraphs. The modularity of such partitioning is defined as [72]: 1 kik j M = ∑(ei j − )pi p j, (2.17) 2∑iki i, j ∑iki

where ki is the degree of tweet (node) yi and pi is the index of the subgraph. To partition graph Λ is equivalent to maximize M. In fact, M can be rewritten in the form of a modularity matrix B’s 1.2

1.0

0.8

30

0.6

0.4

Figure 2.9: Example of tweet-tie heterogeneous graph. Big nodes represent social-ties: red nodes are hashtags, blue nodes are mentions, and yellow nodes are retweets. Small nodes denote tweets.

eigenvalue βi and the corresponding eigenvector ui:

kik j Bi j = ei j − , (2.18) ∑iki T 2 M = ∑(ui P) βi. (2.19) 0.2 i Therefore, maximizing M is approximated by calculating B’s largest eigenvalue β1 and the corre- sponding eigenvector u1. In this way, graph Λ is split into two subgraphs based on the signs of the elements in the first eigenvector u1. This process is repeated until M can no longer be increased by further divisions. Each resulting subgraph corresponds to a tweet subset Y j of the original tweet set Y (Y = ∪Y j), which is referred as a mini-tweet-group. j

Mini-tweet-group Classification Given event-tweets labels that are in a specific domain p, Zp = (x) (x) {(x,Y ,Y¯ )|x ∈ Xp}, a classifier is trained based on a support vector machine. An essential step in this training is feature selection. First, rare words that appear less frequently than some threshold value ζ (calculated from historical statistics) are filtered out, unless they are hashtags, mentions or links. Second, common words such as "love" and "people" should also be removed 0.0 from the feature set. Although these words are frequently mentioned in positive tweets, they are also likely to be more frequent in total Twitter space. n N τ = w / w . (2.20) w n N

0.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 31

In Equation (2.20), nw and n denotes the appearance times of word w and the total number of words in the positive tweets set, respectively, while Nw and N represent the occurrences of word w and count of all words in the entire tweet space. τw of trivial words such as "love" and "people" are bigger than one. Thus, considering both frequency threshold and feature score, feature set WF can be denoted as: (xi) WF = {w|∀w ∈ ∪iY ,τw(t) > 1,nw > ζ}. (2.21)

The feature vector π j of mini-tweet-group Y j is a |WF |-dimensional vector, and each element π jk in π j is defined as:  1, i f wk ∈ Y j, wk ∈ WF , π jk = (2.22) 0, i f wk ∈/ Y j, wk ∈ WF .

In the training process, social ties clustering is first applied to historical labels Zp. For each event (x ) (x ) xi, clustering on positive set Y i and negative set Y¯ i is conducted separately. Then, if a mini- (x ) tweet-group Y j is in positive example set Y i , its classifier indicator s j is set to be 1; if Y j is (x ) within negative examples set Y¯ i , then s j = 0. Our goal for training is to minimize the objective in Equation (2.23) to obtain optimal values for weight ω [44], where C > 0 is a penalty parameter:

1 T T min ω ω +C (max(0,1 − s jω π j)). (2.23) ω ∑ 2 j

For the next step in the testing process, the trained classifier is applied to classify mini-tweet- 0 0 groups from the real-time Twitter data stream Y . Specifically, a mini-tweet-group Y j is predicted T 0 to be positive if ω˜ π j > 0, and negative otherwise, where ω˜ is the optimal solution of Equation (2.23). Finally, all tweets in the positive class are merged into a domain-related tweet set, denoted by Ip.

Event Location Estimation

From the above sections, tweets in the targeted domain (the event-related tweet set Ip) may contain discussions about several different events. The next step is to apply location estimation technolo- gies on targeted-domain tweets to distinguish different events happening during the same time period. One tweet may contain multiple location indicators, such as the geo-tags generated by GPS, lo- cation mentions in the content, and user pre-given locations in the profile. To make the best use of all the location information, a multinomial spatial-scan method is proposed to detect signifi- cant spatial clusters, treating each tweet’s location as a multinomial variable. Suppose there are K cities in one country, then location β˜ y of tweet y can be represented by a K-dimensional vector K ˜ (β1,...,βK), where ∑k=1 βk = 1, and βk ≥ 0. Element βyk in vector βy denotes the probability that tweet y is related to city k. For each tweet y, a location weight vector β˜ y can be computed through the following process. 32

1. Extract the initial geo-location vector h˜ y. Each element hyi in h˜ y is a longitude-latitude (coordinates) pair (uyi,vyi) converted from the geo-terms contained in the original tweet, such as profile locations, geo-tags, and location mentions. The length of vector h˜ y is decided by the number of geo-terms in the original tweet y.

2. Construct the city-level location vector Gy. For a city k, its spatial scope is represented as pair (Uk,Vk), where Uk is a longitude region (uk1,uk2) and Vk is a latitude region (vk1,vk2). City k covers a geo-term hyi = (uyi,vyi) in h˜ y, if uyi ∈ Uk and vyi ∈ Vk. The value of gyk is therefore decided by the number of geo-terms city k covers.

3. Calculate the location weight vector β˜ y. Given the city-level location vector Gy, element ˜ K βyk ∈ βy is then calculated as βyk = gyk/∑k=1 gyk.

Given a real-time Twitter stream Y and event-related tweets set Ip, we can now aggregate the count of event-related tweets at the city-level and apply a fast subset scan [78] to identify a set Ω = {L1,...,LH}, that contains H candidate city clusters with Kulldorff’s statistics [51]:

CA −CR CR CA Kr = (CA −CR)lg( ) +CR lg( ) −CA lg( ). (2.24) BA − BR BR BA In Equation (2.24), CA and BA refer to the total count and base in the country, respectively, where set A contains all cities in the country. CA is computed via the event-related tweets set Ip such that CA = ∑ ∑ βk, where k is a city in country A and m is the number of tweets in set Ip. Corre- m k∈A spondingly, the country-level base BA is calculated through Twitter stream Y that BA = ∑ ∑ βk, n k∈A where k is a city in country A and n is the number of tweets in set Y. Similarly, CR and BR refer to the count and base in the spatial region R, which is a set of neighboring cities. CR is then cal- culated using the targeted-domain tweet set Ip such that CR = ∑ ∑ βk, and BR is calculated using m k∈R the original tweets set Y with BR = ∑ ∑ βk. To reduce the computational cost, we only consider n k∈R regions with a count CR greater than a specified minimum count Cmin and a base BR larger than a specified minimum base number Bmin. The above process yields the candidate city cluster set Ω. Randomization testing is then conducted 0 0 0 on Ω to obtain the significant cluster subset Ω = {L1,...,Lh} of Ω (h ≤ H). Empirically, parameter H is usually set to be greater than the maximum number of potential clusters that may exist, and the insignificant clusters are filtered out later by randomization testing. Only those clusters with empirical p-values smaller than a given threshold Pv (e.g., 0.05) are retained in the result subset Ω0. 0 0 0 Finally, each element Li ∈ Ω is converted into an event xi, which is the eventual output of the event 0 detection module. Specifically, location cluster Li can be represented as a location-tweets pair 0 (i) 0 (i) (Ri,I ), where Ri is a set of neighboring cities and I is the corresponding tweet set. Finally, as (i) the solution to Task 2, the earliest timestamp of tweet in I is used as the event date t 0 , the center xi 33

0 (i) coordinates of R is extracted as event location l 0 , and tweet set I is treated as event-related tweet i xi 0 set I(xi).

2.2.6 Results

In this section, we first introduce the datasets used for evaluation, and then compare ATSED with five existing algorithms. Next, the effectiveness of each component in ATSED is validated. Finally, two case studies from ATSED output are discussed. All experiments were performed on a computer with one 3.20 GHz Intel Xeon CPU and 18.0 GB RAM.

Datasets and evaluation metrics

Two datasets are used in our experiments, one is Twitter dataset and the other is GSR dataset. Both of them consist of data from July 2012 to May 2013 of 10 countries in Latin America. These datasets were separated into two parts: 1) Data from July 2012 to December 2012 were utilized as the label generation data source for ATSED and as the training set for the supervised comparison methods, and 2) Data for January 2013 to May 2013 were used as the testing set for validating all the methods. The Twitter dataset was collected through Twitter API 5. Tweets’ contexts were stemmed and stop- words were removed. Location terms were extracted from the original Twitter data, including GPS geo-tags, location mentions, and user profile locations. Twitter locations used in label generation module are inferred location, with the priority as: location mentions > GPS geo-tags > user profile locations. While spatiotemporal event detection module can use all these location information to estimate the location of detected events. In total, 305 million tweets were collected. Detection results were validated against a labeled events set named "Gold Standard Report" (GSR) 6. Each GSR event consists of date, location, and corresponding news reports. A real world event was selected as a GSR event if it was reported by local news outlets or by influential international media. Table 4.2 lists the detailed information about events of each country. Results of all the methods were validated through GSR events. A detected event is regarded as "matching" a GSR event, if it satisfied following two conditions: 1) the event time detected is the same as that recorded in GSR; and 2) the event location detected is within the same city as that recorded in GSR. Generally, two types of metrics are used in our evaluation: relevance and timeliness metrics.

5https://dev.twitter.com/rest/public 6http://www.mitre.org/ 7In addition to domestic Top 3 news outlets, the following global news outlets are also included: The New York Times; The Guardian, The Wall Street Journal, The Washington Post, The International Herald Tribune, The Times of London, Infolatam. 34

#Training #Testing Country News source 7 Events Events Clarín; La Nación; Argentina 365 318 Infobae O Globo; O Estado de São Brazil 451 361 Paulo; Jornal do Brasil La Tercera; Las Últimas Chile 252 229 Notícias; El Mercurio El Espectador; El Tiempo; Colombia 298 213 El Colombiano El Universo; El Comercio; Ecuador 275 123 Hoy El Diáro de Hoy; La El Salvador 180 127 Prensa Gráfica; El Mundo La Jornada; Reforma; Mexico 1217 811 Milenio ABC Color; Ultima Hora; Paraguay 563 387 La Nacíon Uruguay El País; El Observador 124 104 El Universal; El Nacional; Venezuela 678 557 Ultimas Notícias

Table 2.1: Distribution of events in 10 Latin countries. "News source" shows the news agencies utilized as sources for the GSR dataset.

Specifically, relevance metrics include precision, recall, and F-score: “precision” quantifies the fraction of detected events that are matches to GSR events, “recall” quantifies the percentage of GSR events that are correctly detected, “F-score” represents the harmonic mean of precision and recall. Timeliness metric “lead time” measures the delays between event time reported by Twitter event detection methods and the earliest publish date of news media. A positive value of “lead time” means detected event comes earlier than news, while a negative values denotes this event is first reported by news media rather than Twitter streams.

Table 2.2: Spatial performance comparison among Twitter event detection methods (Precision, Recall, F-score). Numbers in bold show the best F-score values in corresponding countries.

Dataset ATSED Graph Partition Earthquake Topic Modeling TEDAS ST Burst Brazil 0.48, 0.85, 0.61 0.55, 0.34, 0.42 0.65, 0.19, 0.30 0.46, 0.09, 0.15 0.39, 0.20, 0.27 0.80, 0.45, 0.58 Colombia 0.80, 0.92, 0.86 0.68, 0.29, 0.41 0.55, 0.49, 0.52 0.26, 0.39, 0.31 0.66, 0.41, 0.50 0.87, 0.48, 0.62 Uruguay 0.53, 0.34, 0.41 0.28, 0.23, 0.25 0.86, 0.11, 0.20 0.22, 0.06, 0.09 0.88, 0.56, 0.68 0.11, 0.06, 0.08 El Salvador 0.64, 0.62, 0.63 0.35, 0.07, 0.1 0.32, 0.06, 0.10 0.40, 0.05, 0.09 0.71, 0.36, 0.48 0.30, 0.12, 0.17 Mexico 0.69, 0.86, 0.77 0.72, 0.23, 0.35 0.51, 0.19, 0.28 0.34, 0.08, 0.12 0.56, 0.20, 0.29 0.76, 0.43, 0.55 Chile 0.64, 0.77, 0.70 0.83, 0.39, 0.53 0.46, 0.19, 0.27 0.42, 0.48, 0.45 0.96, 0.36, 0.53 0.67, 0.69, 0.68 Paraguay 0.50, 0.85, 0.63 0.76, 0.19, 0.30 0.40, 0.10, 0.16 0.86, 0.07, 0.13 0.88, 0.67, 0.76 0.34, 0.12, 0.18 Argentina 0.57, 0.78, 0.66 0.88, 0.14, 0.24 0.63, 0.57, 0.60 0.38, 0.42, 0.40 0.51, 0.64, 0.57 0.63, 0.73, 0.67 Venezuela 0.87, 0.86, 0.87 0.46, 0.21, 0.29 0.87, 0.22, 0.35 0.47, 0.37, 0.41 0.79, 0.28, 0.42 0.82, 0.33, 0.47 Ecuador 0.74, 0.38, 0.50 0.30, 0.22, 0.25 0.78, 0.60, 0.68 0.67, 0.04, 0.08 0.55, 0.92, 0.69 0.29, 0.26, 0.27 35

Methods for Comparison

We compared ATSED with 5 popular event detection methods, including two supervised algo- rithms, Earthquake Detection [98] and TEDAS [64], and three unsupervised methods, Topic Mod- eling [123], Graph Partition [118], and Spatial Temporal Burst [57]. Detailed experimental settings for these methods were as follows.

• Earthquake Detection [98]: This work designed a SVM classifier to distinguish earthquake- related tweets for event detection. Three features are mentioned in the paper for classification training: statistical, keyword, and word context. All three features were test in our evalua- tion, and keyword feature was chosen for its best performance (measured in F-value).

• TEDAS [64]: TEDAS is another supervised event detection system based on SVM. There are two pairs of tunable parameters (α,β) and (α0,β0) in this paper, which are priors to punish words with low frequencies. The well recommended settings β = β0 = 10 provided by the authors were followed in our experiments. Due to the low percentage of civil unrest content, α and α0 were assigned with a small value 0.1 to capture the sparse data.

• Topic Modeling [123]: The implementation code applied here was provided by the authors. Hashtags were treated as tags and tweet geotags were deemed to be the corresponding geo- graphic regions.

• Graph Partition [118]: The authors employed MAD algorithm [61] to deal with the skew- ness of the signal strength distribution. In our experiment, various settings for the MAD threshold (1, 5, 10, 20, 30, 40) were evaluated and a value of 20 is chosen as it produced the best performance.

• Spatiotemporal Burst [57]: The implementation code was provided by the authors 8. For our experiment, domain words were used as the input queries for the spatiotemporal search engine. The tunable temporal window size was set to 6 as recommended in the original work. We also evaluated other values, including 12 and 24, but observed similar results.

We created a manual label set, which was used as training data for the two supervised compari- son methods (Earthquake and TEDAS). Tweets that were definitely related to “civil unrest” were picked up as positive, for example “With protests in the Zocalo, # YoSoyCan26 requires Iztapalapa dogs to be free”, and those containing some keywords but that were definitely irrelevant to “civil unrest” were deemed to be negative, such as "Measures should be taken to protest trees against winter damage". To strengthen the quality of the training data set, each tweet was assigned to three different annotators. In total, 11,533 tweets were collected for the training, of which about 46% were "civil unrest related" (positive examples), and 54% were non-related (negative examples).

8http://www.cs.ucr.edu/ tlappas/scripts/STBurst.rar 36

All the comparison methods and baselines returned the event-related tweet content, time, and lo- cation. However, in addition to the targeted "civil unrest" events, Topic Modeling and Graph Partition, also returned events that were actually on other topics. In order to ensure a fair compari- son, a SVM classifier trained by the manual label set was adopted to identify “civil unrest” events from the general event set.

Parameter settings

This section gives the settings of all the parameters used in ATSED system.

• Domain weight threshold ηc. In the feature extraction module, threshold αc in Equation (2.7) defines the score boundary ηc between important domain words and trivial ones. As suggested by Leys et al. [61], value of αc can be set as 1/Q(0.75), where Q(0.75) is the 0.75 quantile of the distribution. To maintain a balance between word importance and quantity, αc was set to be 3.97 (ηc = 0.087), which returned a medium-size domain word set with 52 words. Event weight threshold δe can be set in a similar way. • Temporal coefficient λ. As introduced in the relevancy ranking module, Possion parameter λ has a significant impact on temporal similarity. Figure 2.10 illustrates the fitting process of parameter λ. X-axis denotes the temporal distance of tweet and event, where "0" means tweet publish date and event occurrence date are on the same day. Y-axis shows daily event- related tweets number, normalized by their amount sum. To estimate value of λ, 500 events were sampled and fitted to an exponential distribution. On average, λ = 0.48 with R2 = 0.81 was chosen as default setting in our experiment.

• Gaussian mixture coefficient Q. As illustrated in Figure 2.11, there is a trade-off between average relativity score and positive set size. The left-y-axis denotes the proportion of posi- tive tweets. The right-y-axis is the average event-tweet relativity score of positive tweets. A larger value of parameter Q will produce a smaller positive set, with tweets of higher rela- tivity score. Oppositely, a smaller value of Q will involve more tweets into the positive set, in the cost of relativity score decrease. To balance quantity and quality of positive tweets, we set Q to be 4 in this paper, which is the closet value to the intersection point of these two curves.

• Word frequency threshold ζ. Similar to domain weight threshold ηc, MAD method [61] is used to calculate the value of ζ. Following the principle suggested in [61], ζ is set to be 93 to filter non-trivial words.

• Parameters in location estimation module. There are three tunable parameters that may affect the final performance of location estimation: minimal count Cmin, minimal base Bmin, and p-value Pv. No obvious differences are observed when we change p-value Pv from 0.01 to 0.1 or change minimal base Bmin from 10 to 50. The key parameter affecting final performance is 37

minimal count Cmin, which will be discussed in section "Evaluation of the Extended Spatial Scan".

0.3 0.3 11

0.25 0.25 10

0.2 0.2 9 !"#$#% &!"'$($) !"*$$+ 0.15 , !"--''

0.15 8

0.1 Score Average

Volume Ratio Volume Tweet Number Ratio Number Tweet 0.1 7 0.05

0 0.05 6 0 5 10 15 20 25 30 2 3 4 5 6 7 8 9 10 Temporal Distance Initial Values for Q Figure 2.10: Temporal decay pattern of Figure 2.11: Trade-off between average relativity event-tweets. Blue nodes are actual values, score and positive set size for parameter Q. while read line denotes the fitted model.

Performance Analysis

In this part, we first compared the overall performance of ATSED with 5 previous methods, and then separately evaluated the effectiveness of each component in ATSED.

Overall Relevance Evaluation

ATSED was compared with five existing methods, and results are listed in Table 2.2. ATSED achieved the best overall performance that it obtains the highest F-score in 7 out of the 10 coun- tries. TEDAS was the second best method, achieving the highest F-score in the remaining 3 coun- tries. The performance of the supervised learning method Earthquake was comparable with that of ATSED in precision, but failed to match ATSED in recall and F-score. Spatial Temporal Burst per- formed relatively well in large countries such as Brazil, but poorly in small countries like Uruguay. Graph Partition and Topic Modeling yielded the worst overall results, which suggests that, even with SVM-based content filtering, unsupervised methods designed for detecting general topics are still insufficient for detecting events in targeted domains. In general, the two supervised methods (Earthquake and TEDAS) and ATSED performed gener- ally better than any of the unsupervised methods (Graph Partition, Topic Modeling, and Spatial Temporal Burst). To achieve further analysis, Figure 5 compares the temporal performance for the two supervised methods (Earthquake, TEDAS) and ATSED. Three observations were made based on the data reported in Table 2.2 and Figure 2.12. 38

1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5

Recall F1 score F1 Precision 0.4 0.4 0.4

0.3 0.3 0.3

0.2 0.2 0.2 Earthquake Earthquake ATSED Earthquake 0.1 0.1 0.1 ATSED TEDAS ATSED TEDAS TEDAS 0 0 0 2013-Jan 2013-Feb 2013-Mar 2013-Apr 2013-May 2013-Jan 2013-Feb 2013-Mar 2013-Apr 2013-May 2013-Jan 2013-Feb 2013-Mar 2013-Apr 2013-May (a) Precision (b) Recall (c) F1 Score

Figure 2.12: Temporal performance comparison of ATSED, Earthquake, and TEDAS.

1. Overall performance. Both Table 2.2 (spatial comparison) and Figure 2.12 (temporal com- parison) indicate that, ATSED, a semi-supervised approach was able to achieve comparable precision to that of the supervised systems using manual labels, and outperformed them with much better recall and F-score.

2. Spatial Performance. ATSED performed stably in different countries, while Earthquake and TEDAS clearly functioned unstably across different countries. Although TEDAS worked better than ATSED in small countries such as Paraguay and Uruguay, it falled short in large countries like Mexico and Venezuela, which generate more than 32% of the total Twitter data in Latin America.

3. Temporal Performance. ATSED also yielded a stable temporal performance, while Earhquake and TEDAS fluctuated over different time periods. For the February data, Earthquake and TEDAS both suffered sharp decreases in recall and F-score, but ATSED maintained good performance in all three metrics.

In summary, ATSED outperformed all of the other methods in both effectiveness and robustness, clearly demonstrating its ability to yield better results and work more stably across various coun- tries and time periods. Several reasons may account for ATSED’s excellent performance. First, our use of automatically generated labels may contribute to the superior overall performance as they enable ATSED to generate a large amount of high-quality labels for countries with different languages, while it is hard to collect sufficient labels with equivalent diversity manually. Second, far beyond the traditional text-based classifier, the classifier incorporated in ATSED can have a beneficial effect on the final results as it takes into account the social ties among tweets. Utilizing an extended spatial scan can also enhance ATSED’s output by improving quality of the location data. In the following sections, we will further evaluate the effect of each component in ATSED separately. Timeliness Evaluation 39

To further evaluate how soon the newly emerging events can be detected, Figure 2.13 shows the comparison of timeliness metric “lead time” among the three best performers: Earthquake, TEDAS, and our proposed ATSED. In general, our ATSED achieves the best “lead time” of 2.42 days, TEDAS is the second best with 2.34 days ahead of news reports, and Earthquake performs worst with overall “lead time” of 2.04 days.

1. Twitter comes earlier than news. “Civil unrest” events generally appear first in Twitter that even the worst performer Earthquake can detect events 2.04 days prior to the news report. This is because 75% of “civil unrest” events are planned in advance [77], and social me- dia such as Twitter plays a key role in organizing protests, especially in the early stages9. Detecting events from Twitter can provide “beforehand information” for civil unrests, while traditional news media only produce “morning-after” reports.

2. Organized protests come earlier than spontaneous protest. Note that ATSED can obtain better “lead time” in countries such as Uruguay and Argentina than Brazil. We studied Brazilian protests and found that they were more spontaneous compared to other countries: for instance, the initial protests were triggered by bus fare and soon developed into protests against government, most of which were not organized.

Venezuela Uruguay Paraguay Mexico El salvador Ecuador Colombia Earthquake Chile TEDAS Brazil ATSED Argentina

0 1 2 3 4 5 Lead Time Figure 2.13: Lead time comparison of ATSED, earthquake and TEDAS.

Evaluation of Label Generation 9https://goo.gl/8wfhkN 40

1. Northern Ireland live another march day: Demonstrators protest since December by a decree ... http://t.co/O2K9hMIq Tweets 2. #EnImágenes Students protest in several by states against the judgment of the Supreme base- Court http://t.co/clj5XraS line 3. RT @FilosofiaTipica: People change. Love hurts. Friends leave. Things sometimes go wrong. But remember that life goes on. 1. With protests in the Zocalo,#YoSoyCan26 requires government to free dogs of Iztapalapa. http://t.co/XPsQ90po #AMLO Positive 2. #YoSoyCan26 march in solidarity with tweets Socket for victims’ families in by Cerro de la Estrella and demand liberty for ATSED dogs. 3. RT @politicosmex: To people of Mexico, dogs are murderers is incredulous : Government of the capital is asked to clarify the truth ...http://t.co/m5UbmJXT 1. According to reports from the authorities in Iztapalapa, six people are murdered in three offices...http://t.co/qvhCsEhl Negative 2. Your charger does not work anymore? You tweets have a broken dog? Bring it to us in Tampico by Altamira tree http://t.co/fSOj8U2C ATSED 3. RT @ CristhianH23: Dogs killers, untouchable aliens, fair elections, less unemployment, peaceful marches, united people #Mé

Table 2.3: Sample tweets for the baseline method and ATSED. Domain words are denoted by bold style and event words are marked with underlining. The tweets, originally in Spanish, have been translated into English using Google Translate. 41

The effectiveness of the automatic label generation (ALG) component was demonstrated through the high quality of the tweet labels. The above mentioned "dog protest" in Mexico was taken as the case study here, as it was a small scale protest that would normally be hard to identify. The top 3 ranked example labels generated by ATSED are listed in Table 2.3. For comparison, the top ranked tweets retrieved by the keywords matching method [37] are also listed in the table, using words most relevant to "civil unrest", such as "protest" and "march". From the results shown in Table 2.3, tweets obtained through the keyword matching baseline method contain following noises.

1. Tweets irrelevant to the given targeted domain. Some tweets were completely unrelated to the topic "civil unrest". Consider Tweet #3 for example. Its original Spanish text was :"La gente cambia. El amor duele. Los Amigos se marchan. Las cosas aveces van mal. Pero recuerda que la vida sigue". Although this did contain one civil unrest keyword "marchan" (which becomes "march" after stemming), this tweet was in fact about people’s feelings, rather than "civil unrest" events. 2. Tweets irrelevant to the specific event. Within those tweets that were indeed related to "civil unrest", most reflect influential protests that occurred in countries outside Mexico. For ex- ample, Tweet #1 was actually about a protest in Northern Ireland, and Tweet #2 mentioned a protest that happened in Venezuela. Small events such as the "dog protests" were submerged in these "big events".

In contrast, the positive tweets retrieved by ATSED were highly related to the "dog protest" event. These tweets can be summarized into two types.

1. Tweets referred to the protest itself. For example, tweets #1 and #2 contained highly ranked "civil unrest" domain words, such as "protesta" (protest) and "marcha" (march), as well as important event words, for example, "perros" (dogs) and "Iztapalapa" (location name). 2. Tweets related to events that triggered the protest. The reason for the protest was not men- tioned in the news report, but can be revealed according to Tweet #3: citizens were protest- ing to gain the freedom of innocent dogs that had been captured by government officials as suspects in the killing of 4 people. Besides the event words, these tweets also contained middle-ranked domain words such as "Gobierno" (government) and "México", which were weak indications for "civil unrest" when appearing alone, but became stronger when they co-occurred in the same tweet.

In addition, as shown in Table 2.3, ATSED also provided negative examples, which can be gener- ally divided into three types as follows.

1. Low textual score tweets. For example, domain words (authorities and people) and event words (Iztapalapa) contained in Tweet #1 are low weight words and result in a poor textual score. 42

2. Low spatial score tweets. For instance, Tweet #2 had a relatively high textual score, as it contained the strong event word "perro" and the domain word "Tráelo". However, its spatial score was low because the location it provided was the city of "Tampico", which was about 500 kilometers away from the event location (Mexico City).

3. Low temporal score tweets. Tweet #3 had a strong textual score as it contained both "dogs" and "marches", but a weak temporal score as it was published on Jan 19, one week after the event date.

“Precision@K” is used to quantitatively evaluate the quality of generated labels. It is calculated as the ratio of tweets that truly relevant to the targeted domain “civil unrests” among those top K ranked.

\ Precision@K = DT DtopK/K (2.25)

where DT are the ground truth of positive labels from our manual label set mentioned in section 2.2.6, DtopK are top K tweets ranked by methods. Specifically, we selected a mixture label set consisted of 1,000 positive tweets and 5,000 negative tweets, and ranked these tweets through random selection, keyword matching, and our proposed ATSED. The results list in Table 2.4 show that labels generated through ATSED outperform other methods in almost all stages. ATSED beats other methods because it can assign weights for words with knowledge learned from news. The outputs of keyword matching method are acceptable when K is small (e.g.,K = 50), however, its performance drops quickly as K increases and tends to the output of random selection in the final stages.

Table 2.4: Labels quality evaluation through “Precision@K”

P@50 P@100 P@150 P@200 P@250 P@300 P@350 Random selecting 0.18 0.19 0.16 0.17 0.17 0.15 0.15 Keyword matching 0.63 0.46 0.32 0.25 0.22 0.19 0.18 ATSED 0.84 0.79 0.77 0.74 0.73 0.74 0.71

Evaluation of the Tweet Classifier

ATSED’s tweet classifier was compared with that of two supervised methods, Earthquake and TEDAS. To ensure a fair comparison among the tweet classifier components, labels generated by ATSED were used as training data for both Earthquake and TEDAS. Given the same training dataset, any differences among the three methods will depend mainly on the design of the Twitter text classifier. Table 2.5 compares the performance achieved by each of the three methods. The data in the table reveal that: 43

Table 2.5: Performance comparison for Twitter text classifiers (Precision, Recall, F-score). Up- ward arrows denote performance improvements over the original results shown in Table 2.2. Num- bers in bold show the best F-score values for each country.

Dataset ATSED Earthquake TEDAS Brazil 0.48, 0.85, 0.61 0.39, 0.28, 0.32↑ 0.70, 0.53, 0.60↑ Colombia 0.80, 0.92, 0.86 0.29, 0.41, 0.34 0.72, 0.51, 0.60↑ Uruguay 0.53, 0.34, 0.41 0.52, 0.25, 0.38↑ 0.27, 0.44, 0.33 El Salvador 0.64, 0.62, 0.63 0.45, 0.09, 0.16↑ 0.52, 0.58, 0.55↑ Mexico 0.69, 0.86, 0.77 0.62, 0.36, 0.46↑ 0.77, 0.55, 0.64↑ Chile 0.64, 0.77, 0.70 0.69, 0.71, 0.70↑ 0.71, 0.50, 0.59↑ Paraguay 0.50, 0.85, 0.63 0.46, 0.39, 0.42↑ 0.49, 0.79, 0.60 Argentina 0.57, 0.78, 0.66 0.58, 0.66, 0.62↑ 0.42, 0.74, 0.53 Venezuela 0.87, 0.86, 0.87 0.51, 0.42, 0.46↑ 0.80, 0.45, 0.58↑ Ecuador 0.74, 0.38, 0.50 0.16, 0.44, 0.23 0.16, 0.52, 0.25

1. Using labels generated by ATSED improved detection performance for both Earthquake and TEDAS. Comparing Table 2.2 and Table 2.5 reveals that these two methods exhibited ob- vious increases in recall and F-scores in most countries, accompanied by slight decreases in precision. With respect to the F-score, Earthquake performed better than before in 8 countries, and TEDAS achieved gains in 6 countries. Compared to human analysts, ATSED inevitably produced some noisy labels, which may have been responsible for the small re- duction in precision. However, ATSED can easily generate a large amount of relevant labels, which boost both recall and F-score. Creating a manual label set of equivalent size would be extremely expensive and time-consuming.

2. When using the same training data, ATSED outperformed both Earthquake and TEDAS in all ten countries. This observation strongly indicates the effectiveness of our proposed Twitter classifier. Without considering tweets’ distinct features (e.g., hashtags, mentions), Earthquake still turned in the worst performance. While both TEDAS and ATSED took into account Twitter terms as additional features for the SVM classifier, ATSED clustered tweets based on social ties first, which increased the efficiency and precision of the ensuring classification.

Evaluation of the Extended Spatial Scan The new multinomial spatial scan model was also compared with the original spatial scan method [78]. Three tunable parameters are shared by the two methods: cut off threshold p-value Pv, min- imal count number Cmin, and minimal base number Bmin. No significant difference was observed between the two models, when adjusting either p-value Pv or the minimal base number Bmin. To obtain the best performance, we set Pv = 0.05 and Bmin = 20 for both the methods. However, ATSED was sensitive to the parameter minimal count number Cmin. Figure 2.14 plots the precision and recall of the two methods, when the minimal count number Cmin is changed from 2 to 10. We 44

1 1 Traditional Traditional 0.9 Multinomial 0.9 Multinomial

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5 Recall

Precision 0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0 2 3 4 5 6 7 8 2 3 4 5 6 7 8 !"#$% &' !"# !"#$% &' !"# (a) Precision (b) Recall

Figure 2.14: Comparison of multinomial and original spatial scan performance. can therefore make the following observations.

1. In both these methods, recall decreased with increasing Cmin. For all Cmin values, our pro- posed multinomial spatial scan always achieved better recall than the original model. As Cmin increased from 2 to 10, little difference was observed in the term of distance between the two recall curves.

2. Increasing Cmin led to an increase in precision. At the start point (Cmin = 2), the original model had a better precision score. However, our proposed multinomial model obtained a much greater increase rate than the original spatial scan. Therefore, as Cmin increased, the advantage of the original model narrowed and finally disappeared.

3. After Cmin reached 6, both the methods became stable and no more changes were observed. In the stable state, with precision close to 1, our multinomial spatial scan model still main- tained good recall above 0.5, while original spatial scan only achieved 0.4.

In general, the multinomial spatial-scan contributed better recall with little lost in precision. For the metric of recall, our extended spatial scan consistently provided a clear advantage over the original spatial scan. As for precision, the multinomial spatial scan yielded a comparable precision to that of the original spatial scan when Cmin < 6, and achieved the same precision when Cmin ≥ 6. 45

Vehicle ownership tax protest Demand parking on public roads GSR Event, Hermosillo, Jan 20, 2013 GSR Event, Mexico City, Jan 20, 2013

Rejection of education reform

GSR Event, Oaxaca, Jan 20, 2013

Figure 2.15: Case study on spatial factors of ATSED event detection results.

200 180 #CNTE #SNTE #CETEG #YoSoyCan26 160 140 120 100 80

Tweet Number Tweet 60 40 20 0 1/2/2013 2/2/2013 3/2/2013 4/2/2013 5/2/2013 6/2/2013

Figure 2.16: Case study on temporal factors of ATSED event detection results. 46

Case Study

Several interesting partterns were observed in the ATSED output results. Figure 2.15 describes three events detected by ATSED on Jan, 20th, 2013 in Mexico. In the figure, each detected event is represented by a location point (red circle), a summary word cloud, and a correspond- ing ground-truth GSR description. Although these events happened simultaneously in the same country, ATSED successfully distinguished all three and captured their different social focus. As shown in the word cloud, the "Hermosillo" protesters were demanding a reduction in their "vehicle tax", while the event in "Mexico City" was mainly about a "parking" issue, and teachers in "Oax- aca" were marching to protest against "education reform". The cases in Figure 2.15 reveals that ATSED can identify spatial events at the city-level, while most previous Twitter event detection technologies can only detect events at the country-level. Figure 2.16 plots the trends of 3 popular hashtags found in the detected tweets from the ATSED output. All 3 hashtags were related to "teacher" protests: "#SNTE" was the hottest topic among the "civil unrest" tweets at the beginning of March, but "#CETEG" and "#CNTE" became more popular from April onwards. These data patterns were caused by several interesting facts. The head of the National Union of Education Workers (SNTE) was arrested for corruption on Feb, 28th. The scandal stimulated protests against "SNTE" in the following month, and resulted in the popularity of "#SNTE" in March. As "SNTE" suffered from the negative impact of the corruption event, other teacher organizations emerged rapidly and after April, "#SNTE" almost disappeared from the tweet data, being replaced by two new teacher unions, the Guerrero State Coordinator of Education Workers (CETEG) and the National Education Workers Coordinator (CNTE). It requires extensive human efforts to manually relabel training data to keep up with events on the ground in traditional methods, but ATSED is capable of updating its training dataset periodically. Trends in these three hashtags demonstrated ATSED’s ability to capture the dynamics of Twitter data.

2.2.7 Conclusion

This paper provided a model named ATSED to detect spatiotemporal events of targeted domains from Twitter streams. Beyond the civil unrest events studied in this paper, ATSED can also han- dle spatiotemporal events of other targeted domains (e.g., sports, politics, environment). Previous Twitter event detection methods usually require manually labeled data for training, instead, ATSED can generate high-quality label data automatically. Based on these labels, a SVM-based classifier that utilizing Twitter social-ties is trained and applied to real-time Twitter streams to recognize event-related tweets. To enhance the estimation accuracy of event locations, all terms of Twitter location information are considered in multinomial spatial scan component of ATSED. The exper- imental results have shown that ATSED effectively improved detection performance, compared to existing Twitter event detection approaches. And further evaluation demonstrate that each part of ATSED contributes probably to the integral performance.? Chapter 3

Underlying Factors behind Social Media and News

3.1 Analyzing Civil Unrest through Social Media

3.1.1 Introduction

Civil unrests, such as the recent Indonesia worker strikes over pay demands 1 and teachers protest education reforms in Mexico 2, are among the key factors affecting the stabilities of nations. A holistic understanding of civil unrests especially as they relate to the politics, economics, and social relationships between governments and their citizens is an important subject in political science research. One approach to characterizing civil unrest relies on understanding the leading indicators, causal factors, triggers, or other contributing factors prior to the unrests. Newspapers are the most widely used materials for such analysis. However, as shown in Figure 3.1, news reports about on-going protests primarily state the basic event information, such as the date (Jan 12th), location (Zocalo in Mexico City), and participants (150 people), but lack timely and sufficient details that would help reveal the related triggers, the political entrepreneurs, and political organizations. Triggers typically refer to an action committed by the government (e.g., passing legislation; police brutality, etc.) or any third party (e.g., criminal gang activity) or to a natural event causing human suffering (e.g., severe hurricane, major earthquake, etc.). These events are not produced by peo- ple participating in the civil disruption that we are trying to explain, but occur prior to the civil disruption and may or may not be causally connected to the disruption. A political entrepreneur is someone who articulates a call for action in a manner that resonates

1http://goo.gl/QlgFk8 2http:// goo.gl/FM8nM3

47 48

"No animal abuse" liberating Iztapalapa dogs protest Local • January 12, 2013 - 24:26 - Esthela Adriana Flores and Jorge Becerril About 150 people gathered in Zocalo and demanded the release of the dogs caught in the area of ​​Cerro de la Estrella.

Mexico City • Accompanied by a dozen dogs, about 150 people marched movement # YoSoyCan26 around the Zocalo of Mexico City to insist on the freedom of the 57 dogs captured as the homicides in the Cerro de la Estrella.

Shouting “Animal friends are not criminals!" and “Canines are not murderers !”, the demonstrators held up signs alongside the fences surrounding the square.

They argued that canines have no voice but that they would defend against animal abuse. They also demand that those true murders who killed five people in the Cerro de la Estrella should be brought to jail.

Figure 3.1: News article from Milenio, a major Mexican newspaper, about a protest in Mexico City calling for the release of captured wild dogs alleged to have attacked and killed citizens. Like most such articles, it includes basic facts such as the date of the protest (12 January 2013), its location (El Zocalo, the city’s main public square), and the number of participants (150) but provides little insight into the incident’s underlying causes. The article, originally in Spanish, has been translated into English using Google Translate. 49 with those who will participate in the event. Many people will be making calls for action, but a political entrepreneur is someone who has a following. Finally, organizations are real-world or online groups or societies that serve as the medium to help galvanizes protestors. Political entrepreneurs can be the head of such organizations but not necessarily always. Unlike earlier times when newspapers were the only data source for analysis, emerging social media such as Twitter provide great opportunities to comprehend public sentiments and opinions, identify reasons behind, and track the evolution of significant societal events. As a social network and simultaneously an information sharing medium [53], Twitter provides a real-time platform for spreading news and expressing opinions. Analyzing tweets related to specific protests gives useful insights into what causes such events, who the organizers are, and how online expression relates to, mimics, or even evolves into real world events. To make analytics through social media possible, several challenges must be addressed. The first problem pertains to identifying the most relevant tweets pertinent to a certain event from massive Twitter data. After obtaining event-related tweets, the more difficult task is their analysis to in- fer the occurrence of trigger events, recognizing the organizers, and understanding the evolution process.

3.1.2 Event-related Tweet Extraction

To obtain tweets related to unrest events of interest, we developed a ranking algorithm to connect the dots between newspaper reports and tweets. Figure 3.2 shows the main components of our approach. We treat news articles and tweets as bags of words containing three kinds of terms: background terms, topic terms, and event terms. Background terms are phrases like stop words (such as and, if) that are typically discarded in analysis. Topic terms are usually the most frequently mentioned words when describing a certain topic (e.g., civil unrest). As can be seen from the word cloud in Figure 3.2, words such as “protest”, “march”, and “demand” are the most important topic words An article frequently referring to these words is very likely to be related to civil unrest. Beginning with a database of 9080 protest events in 10 countries of Latin America from Jan 2011 to Sep 2013, we first extracted 200 top-ranked keywords based on the statistic term-frequency-inverse- document-frequency (TF-IDF).The database of events was provided by MITRE Inc. which sum- marizes news reports related to civil unrests from global news outlets (e.g., BBC, CNN) and local major media (e.g., Milenio, ABC Color). Event terms are those phrases that can help characterize a specific event. Unlike topic keywords such as “protest” and “march”, which are widely mentioned in multiple protest articles, an article about the “dog protest” incident will have context-relevant words such as “dogs”, “Iztapalapa”, and “Cerro de la Estrella”. These words are frequently mentioned in this event but rarely used in other civil unrest events, and therefore capable of discriminating the “dog protest” from the other protests. 50

Topic Keywords Relativity Ranking Event Words Maiky smith @ maikysmith 8 Jan Wasarasa @ wasarasa 9 Jan The # yosoy132 already preparing demonstrations RT @ AristeguiOnline : sit in the Zocalo demand and protests to demand the release of the 25 dogs freedom of Iztapalapa dogs # YoSoyCan26 http:// captured in Iztapalapa # Yosoycan26 arist.mx/gFMz0 Erika Mendez @ ErikaLMendez 8 Jan Process @ revistaproceso 12 Jan We must do a march of protest peacefully to stop With protests in the Zocalo, requires # YoSoyCan26 animal abuse throughout Mexico # yosoycan26 # Iztapalapa dogs free http:// ht.ly/gL7u0 niunomás ENOUGH! https://twitter.com/search?q=demanda%20Yosoycan2 Event Related Tweets 6&src=typd&f=realtime

Figure 3.2: Event-related tweet extraction pipeline. Red boxes indicate event words for the street dog liberation protest in Mexico City and yellow boxes denote topic keywords. The word cloud in the top left shows top-ranked topic keywords for Mexico, generated from a database of 2,141 protest events in that country from January 2011 to September 2013. The tweets, originally in Spanish, have been translated into English using Google Translate. 51

Event-related tweets are retrieved from all the tweets published around the event date using queries comprising both event words and topic words. As Figure 3.2 shows, each retrieved tweet contains both event words (red boxes) and topic keywords (yellow boxes). The relevance scores of tweets to a certain event (represented by a collection of related news articles) are then quantitatively evaluated by the ranking algorithm considering textual, spatial, and temporal distances. The tweets are further clustered based on their content similarities and social ties, and the cluster with the largest average relevance score is returned as the set of event-related tweets.

3.1.3 Identifying Contributing Factors

Based on the strategy mentioned above, we extract tweets that are relevant to a certain event from tweets published 10 days before and after the event occurrence date. To obtain an informative tweet set, we primarily consider tweets with mentions of URLs linked to news reports. These tweets provide combinations of news and user opinions, which best reflect Twitter’s hybrid nature as a social network and information medium. Figure 3.3 plots the distribution of event-related tweets for the “dog protest”. This distribution is different from that of events related to other happenings (e.g., natural disaster [98]) and regular breaking news [37]. Prior work has found that for these types of events, the number of tweets spikes with the date of event occurrence and drops rapidly soon after. It can be observed from Figure 3 that the number of event-related tweets for a protest such as the one under consideration here begins to rise several days before the event date, as awareness of the issue grows and moves towards an unrest event, and thus a burst in the number of tweets can occur before the actual event date. In this case, the “dog protest” happened on 12th Jan 2013, whereas the largest tweet burst occurred on 8th Jan, 4 days earlier. Analyzing such early bursts, we learnt that we can identify the factors contributing to the burst, such as the key information (e.g., triggers, organizations, political entrepreneurs) of the protest we are looking for. Trigger Events: As shown in Figure 3.3, the trigger for this event is that the Public Security Secre- tariat of Mexico city captured several dogs on Monday (Jan 7th) one day before the occurrence of the burst of tweets. As shown in their tweets (e.g., “the dogs are not murders”, “The ’authorities’ blame dogs, while murders are free.”), Twitter users believed that the captured dogs were “inno- cent”, that the real “murderers” were human beings instead, and the government has turned a blind eye to the truth, and finally, the expected euthanasia of the dogs constitute the true homicide. Organizations: According to a news report as well as our observations on tweet contents, “#YoSoy- Can 26” is the one of the most popular topics in Mexico on Jan 8th. Twitter users call for protest with this hashtag mentioned in their tweets. This hashtag was created by analogy to “YoSoy132” which was used to call for protests against the claimed alleged electoral fraud after the Federal Electoral Institute had delivered the preliminary results which pointed to Pena Nieto as President- 52

200 News 180 # YoSoyCan26, a trending topic Until Monday, Public Security Secretariat (SSPDF) on Twitter in Mexico. Users demand had captured 8 male dogs, 10 female dogs and 7 pups. 160 release of dogs captured by SSPDF. Tweets ▲ Eduardo Hoyos ▲@ Lalo_Hoyos 8 Jan # YoSoyCan26 The "Authorities" blame some dogs , while 140 murderers are free Organizations Trigger Events Vanessa Huppenkothen @ vanehupp 8 Jan 120 You have the judgment and heart to prevent the killing of the dogs of Iztapalapa @ ManceraMiguelMX the dogs are not murderers. 100 AnimaNaturalis, an international animal Political Entrepreneurs welfare organization whose activism is 80 oriented towards . Event Date 60 200

Related Tweet Number Tweet Related 40 News 180 # YoSoyCan26, a trending topic Until Monday, Public Security Secretariat (SSPDF) 20 on Twitter in Mexico. Users demand had captured 8 male dogs, 10 female dogs and 7 pup 160 release of dogs captured by SSPDF. 0 140 Organizations Trigger Events 120 Dog Protest

Figure 3.3: Distribution of tweets related to the street dog liberation protest in Mexico City. Tweets spiked several days before the rally, which is typical of incidents of civil unrest. In contrast, tweets related to breaking news stories and major events such as natural disasters usually spike during the

day of the event.Number Tweet Related The tweets, originally in Spanish, have been translated into English using Google Translate. 53

Elect 3. Political Entrepreneurs: On Jan 8th, as “Yosoycan26” became the trending topic, in- fluential organizations such as AnimaNaturalis4 also participated in the event. As shown in the later section, they act as “Political Entrepreneurs” in this protest, who played an important part in adverting and organizing the protests.

3.1.4 Event Evolution Analysis

In addition to the largest burst on Jan 8th, there were some small bursts and uprising turning points before the event date of Jan 12th (red circles in Figure 3.3). To further explore the reasons for these increases, we look into all the event-related tweets (including tweets without links) on these dates. As shown in Figure 3.4, we arrange tweets related to the “dog protest” along the orange timeline, and place news reports extracted from tweets links along the blue timeline. In the initial phase, news related to “dogs of Iztapalapa” attracted little public attention. The first related news reporting “bodies of a woman and a baby found in Iztapalapa were identified” appeared on Jan 4th. However, this report is relatively unremarked since few related tweets were identified on that day. One day later, the government announced that two more bodies found dead in Iztapalapa were killed by stray dogs. Compared with Jan 4th, people appeared to be more interested in this new discovery, with more related tweets occurring, but still many of them were just repetitions of news report headline. The turning point appeared on Jan 7th, when the government captured stray dogs as suspects. Tweets doubting whether the criminals were dogs began spreading quickly among Twitter users. In addition, the first tweet with hashtag “#yosoycan26” turned up shortly after the news report calling for the freedom of dogs. Within one day, this hashtag became one of the most popular trending topics on Twitter. On Jan 8th, the related tweet volume experienced the biggest burst. Generally, there were two kinds of tweets. The first kind were those showing great sympathy to dogs, such as “dogs of Izatapalapa are innocent”, Beyond simple sympathy towards captured dogs, more tweets showed strong dissatisfaction with the government: “A country, a civilization can be judged by the way its animals are treated”, “There are other dogs more dangerous, some in government and others in the Private Sector”. These tweets attracted great attention from the public, so that even the news media reported “#yosoycan26” as the trending topic on Twitter in Mexico” later that day. These news reports further helped spread “#yosoycan26” to broader audiences. After Jan 8th, the volume of relevant tweets began to decrease, because no emerging news pro- viding additional information was reported after Jan 8th. Since Jan 9th, “Political Entrepreneur” accelerated the evolution of the online topic towards the real world action. One of political en- trepreneurs, the nonprofit organization Animanaturalis, posted tweets on Jan 9th calling for justice: “We demand justice for the digs identified as causing four deaths” and later on Jan. 11th calling

3http://goo.gl/vth5sN 4http://www.animanaturalis.org 54

Bodies of a woman Two more bodies found SSPDF had Demanding Liberating and a baby are in Iztapalapa were killed captured a pack release of dog: Iztapalapa found in Iztapalapa by stray dogs of dogs. #yosoycan26 dogs protest

News Jan 4 Jan 5 Jan 6 Jan 7 Jan 8 Jan 9 Jan 10 Jan 11 Jan 12 Jan 13 Twitter

Vicente Dm @ Vicente_Dm 8 Jan Fast & Furious @ temoc 5 Jan There are other " dogs "most dangerous, some in Pack of dogs attack / The bodies of a woman and a baby government and others in the Effort Private- # yosoycan26 found in Iztapalapa are identified. http:// bit.ly/Z6LR0K La Silla Rota @ lasillarota 8 Jan Oscar Mondragon @omondra 5 Jan # YoSoyCan26 "Iztapalapa Dogs are innocent", recorded 4 killed in a week by attacks of dogs in Iztapalapa. It is tweeters http:// ow.ly/gDH3C far from where politicians live in this city. http:// ow.ly/1QR089 AnimaNaturalisMéxico @ AnimaNat_Mexico 9 Jan We demand justice for the dogs identified as causing four Andrés Gómez Pliego @ draagp 7 Jan deaths! # YoSoyCan26 Help release the dogs of Iztapalapa just passing by # http ://www.animanaturalis.org/n/41184/exijamos_justicia_ yosoycan26 @ luisbecerrilr @ brozo_xmiswebs @ para_los_perros_senalados_como_causantes_de_cuatro FcoEVillaZapata _decesos … Sandra Segovia @ segoviasan 7 Jan Do you really think @ ManceraMiguelMX starving dogs AnimaNaturalisMéxico @ AnimaNat_Mexico 11 Jan of Iztapalapa kill people? Find and punish the real TOMORROW / 11:00 a.m. Concentration Zocalo # criminals. Mexico City for dogs # Iztapalapa . # YoSoyCan26 http://www. animanaturalis.org/e/4094/concent racion_por_caso_de_iztapalapa ...

Figure 3.4: Events leading up to the street dog liberation protest. The blue timeline indicates news reports, while the orange timeline denotes event-related tweets. On the blue line, triangles represent dates with emerging news, and squares are regular dates without emerging news. On the orange line, the size of the circle indicates the relative number of related tweets on the corresponding date. The original tweets, in Spanish, have been translated into English using Google Translate. 55 for planed protests: “TOMORROW/ 11:00 a.m. Concentration City Zocalo # Mexico for dogs of #lztapalapa. #YoSoyCan26.” Finally, the tweets calling for protests became real world action. On Jan 12th, people held a protest demanding the freedom for 25 dogs with placards written with “#yosoycan26”.

3.1.5 Conclusion

The simple approach presented here helps reveal new insights into public events through social media like Twitter, and helps identify key information pertaining to protests and their evolution. Twitter has become a new medium to organize protests, since organizers can share news, express views, and use viral social networking to reach out to thousands of people. Traditional media also play an important as both original sources and in helping spread trending topics from Twitter to the wider population. Our results also point the way for future work. One interesting avenue to gain greater insights into civil unrest is to focus on political entrepreneurship. Models tapping into social networks can provide detailed information and dynamic processes to dissect the characteristics of successful and unsuccessful political entrepreneurs. The fact that tweets are both a social network and simultane- ously an information sharing medium permit analysis of the different means of communication that political entrepreneurs use, and which appear to be most and least useful for them. Given the large N for civil unrest events it will be possible to distinguish among types of political entrepreneurs as well as events.

3.2 Topical Analysis of Interactions Between News and Social Media

3.2.1 Introduction

Recently, online social media such as Twitter have emerged as major platforms for spreading infor- mation [53], and have served as tools for organizing and tracking social events [40]. Understanding the triggers and shifts in opinion-driven mass social media data can provide useful insights for var- ious applications in academia, industry, and government [68, 111]. However, there remains a general lack of understanding of what causes the hot spots in social media. Typically, the reasons behind the rapid spread of information can be summarized in terms of two categories: exogenous and endogenous factors [53, 59, 66]. Endogenous factors are the results of information diffusion inside the social network itself, namely, users obtain information primarily from their online social network. In contrast, exogenous factors mean that users get information from outside sources first, for example, traditional news media, and then bring it into their social network. Although previous works have explored both the social media and external news data datasets, few 56 researchers have looked at the endogenous and exogenous factors based on semantical or topical knowledge. They have either sought to identify relevant tweets based on news articles [38, 43], or simply correlated the two data sources through similar patterns in the changing data volume [109]. More sophisticated methods that are capable of modeling topics within a single data source as well as measuring topical shifts and influence across multiple related datasets are therefore desired. In fact, even within the same data source, there could be various factors that drive the evolution of information over time. For example, Leskovec et al. [60] point out two endogenous factors for news article, “imitation” and “recenc”, which refer to information propagation and temporal variation factors. Similarly, themes within social media can maintain volume growth via a complex interplay among abundant internal historical data [53]. Exogenous factors across multiple datasets make analyzing the evolution and relationship among multiple data streams more difficult [66]. To tackle these challenges, models proposed for this purpose must be able to capture the distinct features of different data sources as well as revealing the influential relationships between them. Monitoring social media and outside news data streams in a united frame can be a practical way of solving this problem. In this paper, we investigate factors related to the endogenous and exogenous information diffusion processes based on semantical and topical knowledge. We propose a novel topic model, News and Twitter Interaction Topic model (NTIT), that jointly learns social media topics and news topics and subtly capture the influences between topics. The intuition behind this approach is that before a user posts a message, he/she may be influenced either by opinions from his/her online friends or by news articles from news agencies. The asymmetrical graphical model we propose is particularly designed to capture such behaviors of social media users. In our new framework, a word in a tweet can be responsive to the topical influences coming either from endogenous factors (tweets) or from exogenous factors (news). Figure 6.2 shows an example of our problem and goals. The example introduced here is a protest happened in Mexico [40]. On January 7, local government arrested 26 dogs as suspects of a murder case. Twitter users angrily demanded the release of the animals that the hashtag “#yosoycan26” (I am dog 26) became a trending topic in the following day, which finally resulted in a real-world protest on January 12. Using the new NTIT model, we attempt to address the following questions: 1) Do Twitter and news cover the same set of topics? As can be seen from the figure, the two datasets share some common topics (e.g., topic “dog” and topic “yosoycan26”), but may also have some distinct topics of their own (e.g., topic “call for protest” only appears in the Twitter dataset). 2) For each topic, which came first, news or tweets? Topics may display different temporal patterns in different datasets. For example, at time t1 topic “yosoycan26” experienced a burst in the Twitter data first, followed by a news burst on the same topic shortly afterwards at time t2. 3) As time goes by, how do topics affect each other? Intuitively, topic “yosoycan26” could be the trigger for topic “call for protest”. With outputs of NTIT model, we can model such directional influence between topics quantitively. 4) What are key contributors (e.g., key documents or key player) pushing evolution of the event? By utilizing controlling variable of NTIT, we could identify key contributors in the event evolution such as milestone news report, hot tweet, and influential users. Our major contributions in this paper are summarized as follows: 57

t1 t2 News-Count yosoycan26 dog

Tweets-Count yosoycan26 dog call_for_protest

Dec/20 Jan/03 Jan/17 Jan/31 Feb/14 Feb/28 Figure 3.5: An example of daily volume and topics on a particular theme in News data (top) vs Tweets data (bottom). Along the timeline (x-axis), the shaded areas represent the numeric values of raw document volume for news articles and tweets; the red and blue curves are hidden topics discovered by our NTIT model. 58

• We propose a novel Bayesian model that jointly models the topics and interactions of multiple datasets. It is already known that knowledge learned from long articles (e.g., Wikipedia) can improve the learning of topics for short messages (e.g., tweets) [17, 85]. Our proposed model can easily transfer topical knowledge from news to tweets and improve the performance of both data sources.

• We provide an efficient Gibbs sampling inference for the proposed NTIT model. Gibbs sampling was chosen for the inference and parameter estimation of NTIT model for its high accuracy in estimations for LDA-like graphical model.

• We demonstrate the effectiveness of the proposed NTIT model compared to existing state-of-the-art algorithms. NTIT model is tested on large scale News-Twitter datasets associated with real world events. With extensively quantitative and qualitative results, NTIT shows significant improvements over baseline methods.

• We explore real world events by using our NTIT model to reveal interesting results. Our proposed model allows a variety of applications related to textual and temporal relationships. The learned estimations of hidden variables can be used for discoveries of various types of interests, such as key documents, topic differences, and topical influences.

3.2.2 Related Work

To the best of our knowledge, this is the first attempt to study the interaction between news and social media through hidden topics. Previous related research has focused on topic modeling on short texts, transferring learning, and the processes involved in topic evolution.

Topic Modeling on Short Texts

Latent Dirichlet Allocation (LDA) [10] has achieved great success in mining hidden topics in documents. Recently, with the development of online social media, there is increasing interest on mining short texts using topic models. Some existing work has looked at the problem of how to apply standard topic modeling approaches in social media environments. For example, Hong et al. [35] tested several schemes to train their LDA model on short messages, and concluded that document length has a significant impact on the performance of standard topic models. Yang et al. treated Twitter topic modeling as a multi-class multi-label classification problem [122], which can be solved using a regularized logistic regression. Some other previous work has applied variations of LDA to capture the latent topics in social media data. For example, Zhao et al. considered each tweet to be associated with only one topic [126], rather than a topic mixture. Vosecky et al. [112] extended LDA to include multiple facets that jointly modeled terms and entities (e.g., “person”, “organization”, and “location”). Ma et al. [69] took a different approach utilizing Hashtag and timestamp to guide the generative process of tweet content. Lin et al. [67] used “Spike and Slab” prior to deal with the sparsity problem of short texts, which allows documents to choose particular topics of interest. 59

Transfer Knowledge in Multiple Datasets

Related work that jointly consider news and social media data sources has mainly focused on learn- ing knowledge from long articles provided and then transferring this learned knowledge to retrieve relevant tweets. Hua et al. [39] proposed a semi-supervised approach to learn “general” and “spe- cific” features from news articles and then retrieve related social media content according to the entities extracted from the news. By controlling parameters such as Dirichlet prior and Bernoulli variables that shift between long and short texts, Jin et al. developed a model that jointly consider short text messages and long text data when mining particular topics [43]. Hu et al. proposed a variation of LDA to extract topics covered by tweets as well as split events into sequential segments [38]. Given a news article, Stajner et al. [104] attempted to select the most interesting subset of so- cial media messages generated in response while Tsagkias et al. [108] proposed a ranking method to retrieve the most relevant social media data from the article. However, none of these studies looked at the interaction between the news and social media data.

Mining Time Series and Topic Evolution

Some attempts have focused on model time series of news and social media and mining their evolution process. Leskovec et al. first studied the news dynamics in social media based on three assumptions regarding the news sources: imitation, recency preference and concurrency [60]. Hong et al. [36] developed this further by integrating topic volume dynamics from [60] with topics shared by multiple text streams. Blei and Lafferty [9] added Gaussian noise to current topics for the generation of topics at the next time stamp; Wang and McCallum [116] also considered time in their study on the generation of topics to discover time-aware topics. Tsytsarau et al. [109] tried to address the problem of the interaction between news and social media by mining the hidden variables that control volume evolution. Unlike our topic-aware approach however, they only conducted convolution and deconvolution on the volume of social media and news data and were thus unable to catch the semantic and topical differences between heterogeneous datasets.

3.2.3 Problem statement and Model

In this section, we first introduce our tasks through an illustrated example in Section 3.2.4. We then move on to present our new NTIT model in Section 5.3, after which we provide its inference algorithm in Section 3.2.5. The important notations used in our paper are summarized in Table 5.1.

3.2.4 Problem Statement

Beyond numeric features of raw document volume [109], focus of this paper is to identify under- lying topics of the two data sources, and explore their relationships. We begin with an example to illustrate our goals and ideas. Background bars in Figure 6.2 represent the input data, including news document set (top) and tweets set (bottom) of a particular theme, such as civil unrest. And 60

Table 3.1: Mathematical Notation

Notation Description R A set of news articles T A set of tweets θr topic mixture proportion for news article r θt topic mixture proportion for tweet t Zr mixture indicator choosing topic for words in news article r Zt mixture indicator choosing topic for words in tweet t Wr words in news set R Wt words in tweet set T xt document indicator for words in tweet to choose topics µt indicator for tweet words choosing the docu- ment to draw topics αr Dirichlet parameters of Multinomial distribu- tions θr αt Dirichlet parameters of Multinomial distribu- tions θt αx Dirichlet parameters of Multinomial distribu- tions µx β Dirichlet parameters for mixture components 61 lines in the figure denote underlying topics found by the proposed model. Based on the hidden topics, how to solve following problems is critical to the understanding of interactions between news and social media data.

1. Which topics are owned by only one dataset, and which topics are more likely to be shared by the both? As shown in Figure 6.2, news and Twitter datasets can cover different topics. For example, topic 1 and topic 2 are common topics shared by both the datasets, while topic 3 only appears in Twitter data. Study of topic coverage is of great importance that can reveal the different focus of the two data sources.

2. Many previous work believe that Twitter can provide more timely information than traditional news. Is the assertion always true? One topic can display distinct temporal patterns in two data sources. For instance, a Twitter burst of topic 2 occurred at time t1 and the corresponding news burst emerged at time t2. Temporal pattern study based on topics provide a chance to compare time-effectiveness of the two datasets in a finer granularity.

3. How to evaluate the influence between topics? There may exist hidden correlations be- tween heterogeneous topics. Topic 2 and topic 3 share many occurred bursts, and topic 2 seems slightly ahead of topic 3. Influence study based on the topics correlation can be useful for understanding the evolution process of complex events.

Based on the above observations from real data, the tasks of this paper can be described as follows. Problem Given a news document set and a tweet set, tasks of this paper include: 1)identify a list of underlying topics for both datasets; 2)characterize the differences in topic coverage between news and Twitter; 3)measure temporal differences in the two datasets based on topics; 4)calculate topic correlations and model the evolution process.

Model

As shown in Figure 5.2, NTIT jointly models news topics and Twitter topics, under an asymmet- rical frame. Specifically, news articles and tweets are connected by a multinomial variable Xt that controls the influence from news to tweets. An LDA-like generative process has been chosen for topic modeling in the news documents (left panel of Figure 5.2). The generative process is described in Algorithm 2. For each document mr, a multinomial distribution θr is first sampled from a Dirichlet with prior parameter αr. For each i word i in document mr, a particular topic assignment zmr is first chosen from Mult(θmr ), and the word is then generated from a topic-specific multinomial distribution ϕ i . zmr For the tweets (right panel of Figure 5.2), to model behaviors of users, we assume that tweets consist of words either sampled from topics originally taken from news documents or from topics generated by the social media network itself. Traditional topic methods such as LDA are known to perform poorly on short and noisy environment like Twitter [38]. Phan et al. [86] pointed out 62

Mr Mt αr θr θt αt

zr zt xt

wr wt µx Nr Nt

β Φ αx

Figure 3.6: NTIT graphical model 63 that learning the hidden topics from long articles such as news reports or blogs can help improve the performance of topic modeling on short texts such as tweets. However, directly applying the trained topic model derived from long texts to short messages will eliminate their distinct features, such as hashtags, mentions, and user comments. Considering the diversity of tweets and their noisy nature, our proposed NTIT learns topics flexibly for tweets. As with words in news documents, each tweet word is generated from a distribution over topics. However words in tweets can be either sampled from a mixture of news topics or from a mixture of tweet topics. A multinomial variable Xt controls the choice of these mixtures for tweet words. If the sampled result of xt is a document mr from news set, the tweet word will draw its topic assignment from mixture θr of document mr. Otherwise, if xt indicates that tweet topics have been selected, the tweet word will be generated from Mult(θt). The benefits of our proposed NTIT can be summarized as follows.

1. Easy to identify common topics. In our NTIT, a common topic term distribution φ is shared by both tweets and news documents, which facilitates the identification of common topics. Meanwhile, the topic variations in different datasets can be easily calculated based on their word frequency weights.

2. Makes it simpler to retain distinct features. By utilizing the control variable Xt, tweets are able to learn enriched topics from the knowledge of long news articles while preserving their distinct features. Meanwhile, unlike symmetrical topic models [43], NTIT is an un- symmetrical model that can prevent errors and noises of tweets from impacting modeling of news documents. αx can be adjusted to prefer key documents.

3. Capable of measuring topic influence. Through indicator Xt, the new NTIT model can easily tell whether a tweet word is generated from news topics or tweet topics. This new control variable can bring together topic-term distribution and doc-topic mixture and thus provide a chance to evaluate the topic level influence. The detailed methodology will be discussed in Section 3.2.6.

3.2.5 Inference via Gibbs Sampling

Although the exact inference of posterior distributions for hidden variables in the NTIT model is generally intractable, the solution can be estimated through approximate inference algorithms, such as mean-field variational expectation [10], and Gibbs sampling [27, 29]. We have chosen Gibbs sampling for the inference of the proposed NTIT model as this approach can yield unbiased estimates for LDA-like graphical models [117]. Based on the generative process illustrated in Algorithm 2 and the graphical model in Figure 5.2, the joint distribution of NTIT model can be represented as Equation (5.5). 64

ALGORITHM 1: Generation Process of NTTT model Input: news words Wr, tweet words Wt, hyperparameter αn, αt, µx, and β, topic number K Output: news topic assignment Zr, tweet topic assignment Zt, indicator Xt, multinomial parameters θr, θt, µx ,and φ Initialization; for each topic k ∈ [1,K] do draw mixture component ϕk ∼ Dir(β); for each news document mr ∈ Mr do

draw topic proportions θmr ∼ Dir(αr); i for each word wr in news document dr do i draw topic index zmr ∼ Mult(θmr ); i draw word w ∼ Mult(ϕ i ); r zmr for each tweet mt ∈ Mt do i for each word wt in tweet mt do draw indicator xt,w ∼ Mult(µt); if xt,w ∈ R then i (xt ) draw topic index zmt ∼ Mult(θr ); if xt,w ∈ T then draw topic proportions θt ∼ Dir(αt); i (xt ) draw topic index zmt ∼ Mult(θt ); i draw word w ∼ Mult(ϕ i ); t zmt 65

P(Zr,Zt,Xt,Wr,Wt|αr,αt,αx,β) R = P(Wr|Zr, )P(Wt|Zt, )P( |β)d R · P(Zr|θr)P(Zt|θt,Xt ∈ R)P(θr|αr)dθr (3.1) R · P(Zt|θt,Xt ∈ T)P(θt|αt)dθt R · P(Xt|µx)P(µx|αx)dµx The key to this inferential problem is to estimate posterior distributions of the following hidden variables: (1) topic assignment indicator Zr for words in news articles, which can be used to infer the document topic mixture θr; (2) similarly, topic assignment indicator Zt and topic mixture proportion θt perform the same function for tweets; (3) topics indicator xt for words in tweets and their conjugate priors µx. From the joint distribution, the full conditional distribution for a word term i = (m,n) can be de- rived, where i denotes the nth word in document m. As a special case of Markov chain Monte Carlo, Gibbs sampling iteratively samples one instance at a time, conditional on the values of the remain- ing given variables. Therefore, taking the inference of Zr as an example, the Gibbs sampler esti- mates P(zr,i = k|Zr,¬i,Zt,Xt,Wt,Wn), rather than the original probability P(Zr,Zt,Xt,Wr,Wt|αr,αt,αx,β). After cancelling those factors that are independent of zr,i, the posterior can obtained in Equation (3.2). We only present the result here; the detailed derivation process is omitted due to space limitations.

P(zr,i = k|Zr,¬i,Zt,Xt,Wt,Wn) k k k k nr,w,¬i+nt,w+βw nr,mr,¬i+nt,mr +αr,k = V · K (3.2) k k k k ∑ nr,w,¬i+nt,w+βw ∑ nr,mr,¬i+nt,mr +αr,k w=1 k=1 k In the above equation, V is the size of the vocabulary, nt,w is the number of of topic k assigned k to word wt in tweet set T, and nr,w,¬i is the number of of topic k assigned to word wr in news article set R, without current instance i and its topic assignment k. Similarly, nk denotes the r,mr,¬i k number of times topic k is assigned to words in news document mr, except for instance i. And nt,mr is the number of times topic k appears in words of tweets, which are generated by topic mixture proportion θr of document mr.

The inference of Zt is slightly different from that of Zr, since words in tweets can be drawn from either a news document mr or a tweet message mt. Therefore, the conditional probability of P(zt,i = k|Zr,Zt,¬i,Xt,Wt,Wn) can be calculated through two cases determined by the topic choosing indicator Xt.

When Xt ∈ R, word topic assignment Zt is drawn from a Multinomial distribution by the Dirichlet prior θr from a news document mr:

P(zt,i = k|Zr,Zt,¬i,Xt,Wt,Wn) 66

k k k k nr,w+nt,w,¬i+βw nr,mr +nt,mr,¬i+αr,k = V · K (3.3) k k k k ∑ nr,w+nt,w,¬i+βw ∑ nr,mr +nt,mr,¬i+αr,k w=1 k=1

k where nt,w,¬i is the number of of topic k assigned to word wt in tweet set T, without word i and its topic assignment k, and nk is the number of times topic k appears in words contained in tweets, t,mr,¬i which are generated by topic mixture proportion θr of document mr, except current instance i.

When Xt ∈ T, word topic assignment Zt is drawn from a Multinomial distribution by the Dirichlet prior θt from a tweet document mt.

P(zt,i = k|Zr,Zt,¬i,Xt,Wt,Wn) nk +nk + nk + r,w t,w,¬i βw t,mt ,¬i αt,k = V · K (3.4) nk +nk + nk + ∑ r,w t,w,¬i βw ∑ t,mt ,¬i αt,k w=1 k=1

In Equation (3.4), nk is the number of times topic k appears in the words of tweets, which are t,mt ,¬i generated by topic mixture proportion θt of tweet mt, except for instance i; other notations can be found in Equations (3.2) and (3.3).

As can be seen from Algorithm 2, Xt is a control variable that determines whether a tweet word is sampled from a tweet message mt or a news document mr. To facilitate the inference, the Dirichlet distribution is chosen as the conjugate prior for Xt. As for Zt, the posterior of Xt is discussed for two cases here. When Xt ∈ R, we have:

P(xt,i = u|Zr,Zt,Xt,¬i,Wt,Wn) nk +nk +α nu +α = r,mr t,mr,¬i r,k · xt ∈R,¬i x,u K Mr+Mt (3.5) nk +nk +α nu + ∑ r,mr t,mr,¬i r,k ∑ xt ∈R,¬i αx,u k=1 u=1 where nu is the number of tweet words (except for word i) choosing topic mixture proportion xt ∈R,¬i of news document ur.

For words with Xt ∈ T, tweet messages are chosen as the topic mixture proportions.

P(xt,i = u|Zr,Zt,Xt,¬i,Wt,Wn) nk +α nu +α = t,mt ,¬i t,k · xt ∈T,¬i x,u K Mr+Mt (3.6) nk +α nu + ∑ t,mt ,¬i t,k ∑ xt ∈T,¬i αx,u k=1 u=1 In Equation (3.6), nu denotes the number of tweet document u chosen as the topic mixture xt ∈T,¬i proportion for tweet words (except word i). 67

K Mr Mt Finally, we need to obtain multinomial parameters = {ϕk}k=1, r = {θr,m}m=1, t = {θt,m}m=1, Mr+Mt and µx = {µu}u=1 . According to Bayes’ rule and the definition of Dirichlet prior, these multino- mial parameters can be computed from the above posteriors.

k k n + n + βw ϕ = r,w t,w (3.7) k,w V k k ∑ nr,w + nt,w + βw w=1

k k n + n + αr,k θ = r,mr t,mr (3.8) r,m,k K k k ∑ nr,mr + nt,mr + αr,k k=1

k n + αt,k θ = t,mt (3.9) t,m,k K k ∑ nt,mt + αt,k k=1

u nx + αt,k µu,m = (3.10) Mr+Mt u ∑ nx + αx,u u=1

3.2.6 Discovery for topic lags and influence

The output results of NTIT model can be used as the basis for further discoveries, such as topic dis- tribution differences, topic temporal patterns, topic influence, and key news documents or tweets.

Topic distribution differences

As mentioned in Section 5.3, news and tweet documents share the same topic-term distribution K = {φk}k=1, which benefits identifying the common topics. Meanwhile, the difference of topic distribution between the two datasets can be evaluated through integrating their respective word distribution. Therefore, the distinct topic-term distribution Dr,k of news documents and Dt,k of tweets can be calculated as follows.

Mr Nmr D = · nk r,k ∑ ∑ ϕk,w mr,wr (3.11) mr=1 wr=1 68

Mt Nmt D = · nk t,k ∑ ∑ ϕk,w mt ,wt (3.12) mt =1 wt =1

In Equation (3.11) and (3.12), mr and mt denote a specific news or tweet document, Mr and Mt are the total number of news documents and tweets, φk,w is the probability of word w in topic k, and k k nmr,wr and nmt ,wt are the count of a specific word in news document mr or tweet mt.

Topic temporal patterns

To evaluate the temporal patterns of topics, we construct topic-term time series by splitting topic- term distributions (Equation (3.11) and Equation (3.12)) through daily sliding window. Taking news data for example, Tr,k represents the topic-term time series, and each element in the time- seies is a topic-term distribution at time τ denoted Dr,k(τ). Instead of integrating all news docu- ments Mr shown in Equation (3.11), Dr,k(τ) only considers news documents with timestamp tmr equal to τ. Twitter topic-term time series Tt,k can be calculated in the same way.

Tr,k = {Dr,k(τ) : τ ∈ T} (3.13)

Nmr D ( ) = · nk r,k τ ∑ ∑ ϕk,w mr,wr (3.14) tmr =τ wr=1

Topic influence

In NTIT model, topics are multinomial distribution over words φk,w, the topic-document indicator χw,u denotes number of times document u is chosen by word w for generation, and θu,k implies the probability of topic k appearing in document u. By integrating the three variables and marginalizing φk,w over words, the probability of topic k j being influenced by topic ki can be evaluated through Equation (3.15).

p(ki → k j) = ∑ ϕki,w · χw,u · θu,k j (3.15) w∈T,u∈DR∪DT Equation (3.15) provides a method to quantify the directional topic influence between any two topics, from which we can easily explain whether a topic k j is evolved from topic ki.

Key news reports and tweets

The topic-document indicator u = Xt,w represents that: document u is chosen as topic mixture prior to generate tweet word w. For each document u, count of Xt,w = u can therefore imply its 69

importance Iu. w∈T u Iu = ∑ Xt,w (3.16) w=1 The more important a document u is, the more words will refer it as the topic mixture, and therefore yielding a bigger Iu. The top ranked news reports and tweets are treated as key documents that dominate topics.

3.2.7 Experiment

In this section, we first describe our evaluation datasets, and then compare our proposed NTIT model with existing state-of-the-art algorithms. Finally, extensive discovery results are presented by exploring the output of NTIT.

Dataset

To construct our evaluation datasets, we crawled publicly accessible data using RSS API and Twit- ter API 1. Our datasets consist of two parts: the News Dataset and the Tweets Dataset. News Dataset. In our study, we focus on influential civil events in Latin America. Events in this domain are chosen due to their great social influence and high evolution complexity. An event is considered “influential” if it is reported by all the top local news outlets (shown in Table 4.2). News reports corresponding to the event are downloaded as data for the News dataset.

Table 3.2: Distribution of events and tweets across 5 Latin countries. “News source” indicates the news agencies utilized as sources for News dataset.

Country News source 2 #Events #Tweets Argentina Clarín; La Nación; Infobae 9 67,365 O Globo; O Estado de São Brazil 11 338,017 Paulo; Jornal do Brasil El Espectador; El Tiempo; El Colombia 7 60,578 Colombiano Mexico La Jornada; Reforma; Milenio 30 576,392 El Universal; El Nacional; Venezuela 17 224,196 Ultimas Notícias

Tweet Dataset. The tweets used for the experiments in this paper are collected via the following steps.

1. Select keywords from the title and abstract of a news report.

1https://dev.twitter.com/rest/public 2In addition to the Top 3 domestic news outlets, the following global news outlets are included: The New York Times; The Guardian, The Wall Street Journal, The Washington Post, The International Herald Tribune, The Times of London, Infolatam. 70

2. Retrieve relevant tweets by keywords identified in Step 1. 3. Manually check the tweets from Step 2 to confirm whether they are indeed relevant to the given events. 4. In truly relevant tweets, identify those hashtags specifically correlated to the given events. 5. Retrieve Twitter data again through the hashtags identified in Step 4.

Table 4.2 shows the statistics for our datasets. In total, we selected the 74 most influential events in the period from January 2013 to December 2013 that occurred in 5 countries in Latin America, including 1,266,548 tweets and 132,756 news reports. Standard NLP preprocessing is performed on both news and Twitter datasets, including stop words removing, POS tagging, and phasing3. There are an average of 25.2 words per tweet message and 304.7 words per news article.

Results of modeling performance

Perplexity Perplexity is a standard metric used to evaluate topic models’ capabilities of estimating data, which is typically defined as follows:

−∑ logP(wd|M ) Perplexity(D|M ) = exp{ d∈D } ∑d∈D Nd

where M is the model learned from the training dataset, wd is the word vector for document d and Nd is the number of words in d. A lower perplexity indicates more accurate performance of the model. Few previous works have jointly modelled news and Twitter datasets. In this paper, we have chosen standard LDA [10] and Gamma-DLDA [43] (also a joint model on short and long texts) as baselines for comparison. Figure 4.4 presents the perplexity comparison for the 3 models on both the news and Twitter datasets. Generally, perplexity decreases as K increases, which indicates that a larger K (number of topics) can better explain textual data. Gamma-DLDA returns high perplexity values, LDA and ET-LDA achieves intermediate performance, and our model exhibits lowest perplexity on both news and tweets. The poor performance of Gamma-DLDA is due to its completely symmetrical structure. Long articles are known to be helpful for improving the modelling performance of short messages [86], but a symmetrical structure will propagate errors and noises from short texts to long texts. Unlike Gamma-DLDA, our NTIT model is unsymmetrical in structure, which can improve Twitter modelling performance through knowledge learned from news, as well as suppressing the negative impact from Twitter to news. ET-LDA is also an unsymmetrical model and therefore gets the second best performance on tweets. However, tweet words in ET-LDA can only be generated from news topics or background topics, excluding key tweet topics which are considered in NTIT. LDA is a traditional model for topic analysis, but achieve non-trivial performance on both news and Twitter. Next, we will evaluate NTIT model against the baseline method LDA in terms of semantical meaning.

3NLP tools are from http://www.basistech.com/ 71

5000 2000 4500 LDA 1800 LDA GAMMA-DLDA GAMMA-DLDA 4000 NTIT 1600 NTIT 3500 ET-LDA 1400 ET-LDA 3000 1200 2500 1000

Perplexity 2000 Perplexity 800 1500 600 1000 400 500 200 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Number of Topics Number of Topics

(a) Perplexity of News (b) Perplexity of Tweets

Figure 3.7: Perplexity Comparison for News and Tweets Datasets

Semantics In this part, we evaluate NTIT model against the baseline method LDA in terms of semantical meaning. Table 3.3 presents the top words of 3 selected topics discovered from the theme “teacher protests” in Mexico. For better interpretation, the listed topics are manually assigned with mean- ingful labels. “CNTE”, “SNTE”, and “CETEG” are three teacher organizations in Mexico, which emerged as important topics in different periods throughout the series of protests. As can be seen from Table 3.3, NTIT and LDA display similar performance on news datasets, but yield quite different results for the Twitter dataset. This leads to several interesting observations:

1. Each topic from NTIT can be easily correlated to the corresponding label, since the repre- sentative hashtag is ranked highly. For example, in the topic “CETEG” for the NTIT model, the hashtag “#CETEG” is the top ranked word in the word list and most remaining words are directly related to the label. In contrast, it is to hard distinguish topics in LDA: (i) both “#snte” and “#cnte” appear in the topic “CNTE”; (ii) topics share too many common words, such as “gobierno”(government), “reforma”(reform), which indicates that the LDA model tends to output unclear topic mixtures.

2. Most words identified as NTIT topics are related to the label, such as “marcha”(march) and “maestro”(teacher). But LDA seems to produce more meaningless background words, such as “televisar”(television), “#Mexico”, and “#foto”(photo).

3. It is also clear that tweet topics from NTIT retain more distinct Twitter features than LDA. In addition to the key word “#ceteg”, the NTIT “CETEG” topic contains event specific hashtags such as “#FebreroMesDeLaCruzada”. Similar examples can also be found in the other two 72

Table 3.3: Top words of top topics of NTIT and LDA

Methods Topics Words NTIT CETEG ceteg, guerrero, azteca, autopista, acceso, partido on CNTE cnte, maestro, veracruz, querer, oaxaca, reforma, coordinador, marcha news SNTE snte, maestro, nacional, trabajador, sindicato, federal, peso NTIT CETEG #ceteg,deber, guerrero, educativo, clase, lucha, #FebreroMesDeLaCruzada on CNTE #cnte, maestro, quincena, marcha, robar, #oaxaca, gabinocue, lana Tweets SNTE reforma, #snte, elba ,educativo, #educacion, arresto, nacional, gobierno, deber LDA CETEG encontrar, querer, llegar, deber, tiempo, presidente on CNTE cnte, maestro, reforma, gobierno, ciudad, educativo, nacional News SNTE snte, tomar, encontrar, acuerdo, clase LDA CETEG #ceteg, televisar, gobierno, #mexico, reforma, #foto, apoyar, pedir, educativo on CNTE #cnte, maestro, #snte, reforma, marcha, educativo, gobierno, derecho Tweets SNTE maestro, #snte, reforma, #foto, educativo,gobierno, nacional, pedir

topics “CNTE” and “SNTE”. This result demonstrates that the NTIT model is able to prevent short texts from being “submerged” by long text topics.

Results of topic evolution discovery

Topic distributions and influence Do news outlets and Twitter cover the same topics? To explore this question, we can calculate topic-term distributions using Equations (3.11) and (3.12) , and the normalized results are shown in first two columns of Table 3.4. The results clearly show that topics distribute quite differently in Twitter and news. Topics 5,7,11,12, and 19 are tweet-dominant topics that mainly appear in Twitter data (red rows), while topics 2,3,9,10,13,16,17 and 18 are news-dominant topics that are more likely to exist in news data (green rows); the remaining topics are common topics that are almost evenly distributed across Twitter and the news (yellow rows). To further explore the relationships between topics, we can apply Equation (3.15) to calculate the topic influence, producing the results shown in Figure 3.8. Each node in Figure 3.8 represents a topic and the correlations between topics are denoted by the width of the edges. Edges with widths below a certain threshold (e.g., 0.15) are ignored. As in table 3.4, yellow nodes are common topics, red nodes are tweet-dominant topics, and green nodes are news-dominant topics. The directions of the arrows imply the directions of influence. Node features such as degrees, in- degree ratio, and out-degree ratio are listed in the last three columns of Table 3.4. Compared with the news-dominant topics and Twitter-dominant topics, common topics are more likely to have greater numbers of connections, such as topic 0 (No.1 in degree) and topic 1 (No.2 in degree). News-dominant topics have a strong influence on other topics, with 62% edges being outgoing 73

Figure 3.8: Topic Influence 74 arrows. In contrast, tweet-dominant topics are weak in influence that none have an outgoing edge. These observations mirror the real world situation: news agencies can easily lead public opinion, while the voice of individuals is almost negligible.

Table 3.4: Topic Influence. “Twitter %” is the ratio of topic in Twitter data, while “News%” is the ratio of topic in news data. “Degree” denotes the node degree for each topic,“In%” is the ratio of in-coming edges, and “Out%” is proportion of out-going edges.

Topic Twitter % News Degree Out% In% 0 0.56 0.44 16 0.6 0.4 1 0.44 0.56 10 0.4 0.6 2 0.34 0.66 2 1 0 3 0.26 0.84 0 0 1 4 0.48 0.62 0 0 1 5 0.92 0.08 6 0 1 6 0.48 0.52 3 0 1 7 0.91 0.09 3 0 1 8 0.47 0.53 5 0.2 0.8 9 0.35 0.65 7 1 0 10 0.36 0.64 2 0 1 11 0.89 0.11 1 0 1 12 0.67 0.30 0 0 1 13 0.32 0.68 4 1 0 14 0.45 0.55 3 1 0 15 0.49 0.51 0 0 1 16 0.26 0.74 2 1 0 17 0.27 0.73 1 0 1 18 0.32 0.68 2 1 0 19 0.92 0.08 1 0 1

Temporal Patterns Many researchers believe that Twitter data are disseminated earlier than tradi- tional media when spreading news [37]. Is this true? To answer this question, we can quantitatively compare the temporal difference between the Twitter and news topics. Time series are first cal- culated through Equation (3.13) and Equation (3.14), after which peaks can be detected using pypeaks 4. Results for the topic temporal features are listed in Table 3.5. Looking at the last row of Table 3.5, the Twitter data come slightly earlier than News in terms of bursts, with an average lead time of 0.36 hours. Red rows denote the topics that appearred earlier in tweets, with large values in positive peak ratio. Green rows are topics that showed up first in the news, with high negative ratios. Yellow rows indicate topics with approximately simultaneous peaks. Interesting patterns can be obtained by correlating Table 3.4 with Table 3.5. Generally, 5 out of 7

4https://github.com/gopalkoduri/pypeaks 75

Table 3.5: Comparison of topic temporal patterns. “Pos%” denotes the ratio of peaks occurring ear- lier in Twitter than in news, “Neg%” implies that peaks appeared earlier in the news, and “Sim%” indicates the ratio of peaks that burst simultaneously in the two datasets. “Avg.Lag” indicates the average time lags between news and Twitter peaks, where positive values imply Twitter data come first while negative numbers denote the leading time of news data.

Topic Pos% Neg% Sim % Avg. Lag 0 0.20 0.30 0.50 -0.60 1 0.36 0.27 0.36 -1.09 2 0.21 0.50 0.29 -1.14 3 0.25 0.33 0.42 -1.33 4 0.40 0.20 0.40 -0.20 5 0.00 0.00 0.00 0.53 6 0.44 0.22 0.33 0.30 7 0.44 0.56 0.00 -0.25 8 0.47 0.18 0.35 2.12 9 0.33 0.25 0.42 1.00 10 0.43 0.14 0.43 0.22 11 0.36 0.29 0.36 -0.57 12 0.22 0.44 0.33 -2.67 13 0.44 0.44 0.11 0.22 14 0.40 0.30 0.30 0.40 15 0.58 0.33 0.08 2.00 16 0.54 0.15 0.31 3.69 17 0.18 0.36 0.45 -1.82 18 0.00 0.60 0.40 -2.40 19 0.41 0.18 0.41 2.82 Total 0.35 0.31 0.34 0.36

common topics in Table 3.4 are also simultaneous topics in Table 3.5, 4 out of 5 topics that first in the news in Table 3.5 are news-dominant topics in Table 3.4, and 4 out of 5 topics that show up first in the news in Table 3.5 are either tweet-dominant topics or common topics in Table 3.4. Outliers are topic 12 and topic 16. Topic 12 is a Twitter-dominant topic in Table 3.5, which would thus be expected to appear first in tweets but in fact occurs earlier in the news data. Topic 16 is a news-dominant topic in Table 3.5, that shows up first in tweets. Top ranked words in topic 12 include: “educativo” (educate), “elba” (name of the leader of SNTE), and “arresto” (arrest). By manually checking corresponding news and tweets, we found that: at the end of 2013 February, the leader of SNTE “Elba Esther Gordillo” was arrested by the Mexican government because of cor- ruption allegations. This event was just a regular news report for news agencies, but unexpectedly attracted great attentions from social media users, and actually became the main trigger of many of the subsequent protests. Top ranked words in topic 16 include: “marcha(march)”, “oaxaca”, and temporal terms such as “12:30pm”. Obviously, items in topic 16 can be regarded as organized events that developed from virtual social media first and then caught the attention of traditional 76 media once events began to occur in the real world. Key news reports and tweets

Table 3.6: Top 5 key news documents in “teacher protests” theme. Texts are translated from Spanish to English by Google translator.

News ID importance count news report title 985 478 CNTE prepare to build the united organization 4243 414 Politics at play in Mexico’s ongoing teacher protests 1684 409 Teachers’ movement: faces and reasons for fighting 5453 351 SNTE creative protest against the constitutional reform 8468 347 Protesters in 14 states join the protest CNTE

Table 3.7: Top 5 key tweets in “teacher protests” theme. Texts are translated from Spanish to English by Google translator.

ID importance content author 413114 91 Bullying also occurs from student to teacher: SNTE leader proceso 332824 41 teachers retired in protest because they pay them their retirement insurance, SoledadDurazo accuses indifference SECC 28 of SNTE 38974 32 #EnVivo The eviction of members of #CNTE http://bit.ly/1aI8AeQ #Eventos AgendaFFR #news #Nacional #DF #Maestros #Protesta 136883 17 CNTE members marched on Reforma and Bucareli to Segob where REFORMACOM assembled tents to install a sit 39368 15 The # socket and was evicted by police, congratulations can now celebrate their josemiguelgon “independence” and “freedom”. #CNTE

Table 3.6 and Table 3.7 present the top ranked key news articles and tweets respectively, according to the importance calculated using Equation (3.16). News documents are more frequently cited by words than tweets. As can be seen from the Table 3.6 and Table 3.7, news documents have hundreds of references, while even most popular tweet messages are only cited less than 20 times. This is quite reasonable since news documents are much longer and have more words than tweet posts. It is also clear that the key news articles listed are representative, largely because they are either the most updated movement reports (e.g., News 985) or for the comprehensive event analysis they provide(e.g., News 1684). 77

Interesting results can be found in the key tweets listed in Table 3.7. Most of these top ranked tweets are posted by key players, such as celebrities or authoritative media. For example, tweet 332824 is posted by a user named “Soledad Durazo”, a famous journalist in Mexico. Other key tweets contain numerous keywords, such as tweet 38974, which basically consists of a set of popular hashtags.

3.3 Conclusion

In this paper, we have proposed a hierarchical Bayesian model NTIT to analyze the interaction be- tween news and social media. Our model enables jointly topic modeling on multiple data sources in an asymmetrical frame, which benefits the modeling performance for both long and short texts. We present the results of applying NTIT model to two large-scale datasets and show its effective- ness over non-trivial baselines. Based on the outputs of NTIT model, further efforts are made to understand the complex interaction between news and social media data. Through extensive exper- iments, we find following factors: 1) even for the same events, focuses of news and Twitter topics could be greatly different; 2) topic usually occurs first in its dominant data source, but occasionally topic first appearing in one data source could be a dominant topic in another dataset; 3) generally, news topics are much more influential than Twitter topics. Chapter 4

A Probabilistic Model for Discovering Common and Distinctive Topics from Multiple Datasets

4.1 Introduction

Most domain experts suggest that “comparative thinking” strategies are the most effective way to improve learning [103]. The key to comparative thinking is to distinguish the common and distinctive aspects between two objects. In the field of data mining, topic models have been widely used to identify the hidden topics underlying the content [10]. However, most previous studies have focused on modeling datasets in isolation, due to their inability to simultaneously discover the common and distinctive topics among multiple datasets. Novel technologies that are capable of comparative thinking are therefore highly desirable for many various applications. In order to simultaneously identify the common and distinctive content, the following capabilities are necessary. 1) Clearly revealing common and distinctive topics. Traditional topic modeling methods such as LDA [10] or NMF [58] are unable to achieve good performance in discriminative learning, and although running standard topic modeling methods separately on different datasets is one possible solution. However it will generate non-comparable topics with different distribu- tions, and therefore require additional processes such as topic pair mapping to further determine the common and distinctive topics. The performance is unlikely to be adequate due to the lack of clearly defined structures specially designed for common and distinctive topics. 2) Focusing on distinctive learning for content understanding. Few previous studies have provided solutions specifically for comparative thinking, most related work on distinctive learning studied class con- tent features and focused label prediction [54, 97, 91, 93, 94]. There remains a serious lack of models specifically designed for identifying the common and distinctive contents across datasets. 3) Learning on the entire collection level. Although there has been considerable previous work

78 79 on mining the global and local aspects for documents [115, 82, 25, 42, 18], most studies have been restricted to working within one document collection. The problem described here requires learn- ing across different datasets, a far more difficult task. 4) Supporting multiple datasets. Focusing solely on the main purpose of the current study, discNMF is the closest to our work [48], Kim et al. proposed an NMF-based approach for distinctive learning, based on a scenario for two datasets, which would be difficult to extend to multiple data collections. As real world tasks generally re- quire the analysis of more than two datasets simultaneously, a more general model that can be applied to arbitrary number of datasets is clearly needed.

(a) Distinctive topics for Clinton. (b) Common topics. (c) Distinctive topics for Trump.

Figure 4.1: Topic summaries for news articles published in October 2016 related to the US presi- dential election.

NLP, Converge, Optimization, Gradient, Training, Generation

1987 2013 Neural SVM Bayesian Deep Network Boosting model Learning

Figure 4.2: Topic summaries for NIPS papers from 1987 to 2013.

In this paper, we propose a novel approach for Common and Distinctive Topic Modeling (CDTM) on multiple datasets. Our goal of discriminative learning is implemented through a novel proba- bilistic model, that not only discovers topics characterizing a specific corpus, but also maximally exploits the shared information across multiple corpora. This type of discriminative learning pro- vides the basis for a number of important applications, such as comparing similar content with the same time stamp, or analyzing content evolution across different time periods. In this content, similar content refers to datasets belonging to the same domain (common topics), but with differ- ent emphasis or features (distinctive topics). For example, authors with different educational or cultural backgrounds are likely to have slightly different opinions regarding the same theme. The main goal of evolution study is to understand how the overlaps and changes between old and new documents, for instance, what are fading and newly emerging topics (distinctive topics), and what topics are consistently present (common topics). 80

Figure 5.1 shows an example of the results obtained by applying the proposed CDTM model to discover the common and distinctive topics with different news datasets published in the same time range. Sepfically, from news articles containing the headline word “Clinton” or “Trump” published in October 2016. Figure (4.1a) shows Clinton’s distinctive topics, suggesting that the most significant words from her distinctive topics are related to investigations, such as “email”, “FBI”, and “security”. On the other hand, most words from Trump’s distinctive topics concern issues, such as “immigration”, “border”, and “abortion”. However, as can be seen from Figure (4.1b), despite facing different difficulties, the two presidential candidates share common interests such as “election”, “president”, and “voters”. This example illustrates how a model capable of discriminative learning can identify the special aspects of each dataset, while at the same time capturing the key similarities between the datasets. Figure 4.2 is an illustration example of evolution study. It shows how the research fields evolves with analysis to NIPS papers published from year 1987 to 2013. “Neural Networks” is the most popular terms in the 1990s, “SVM” and “boosting” methods are hot topics after that. “Bayesian” models such as LDA are rising in the later years, while “deep learning” research gains more and more attentions recently. Above mentioned terms reflect the distinctive features in different time stamps. Meanwhile, there are some terms such as “NLP”, “’Converge”, and “Optimization” are consisted popular over time, which are belonging to the common topics of datasets across all time stamps. The proposed CDTM model is a specially designed Bayesian graphical model that learns common and distinctive topics simultaneously based on a hierarchical structure. In CDTM, several global topic mixtures and word distributions are shared by all the different document sets, while more local topics are independently owned by each subset. The global structures (common topics) and local distributions (distinctive topics) are learned within the same unified framework through word- level topic assignments. The main contributions of this paper can be summarized as follows.

• A novel Bayesian model is proposed to simultaneously identify common and distinct topics among different datasets. The proposed CDTM model is the first graphical model to focus on identifying common and self-owned topics among multiple datasets, and can be used to develop a wide range of applications. • An efficient Gibbs sampling inference is provided for the CDTM model. Gibbs sampling is utilized to estimate the parameters of the CDTM model due to its high accuracy when performing estimations for LDA-like graphical models. • The effectiveness of the proposed CDTM model is demonstrated through extensive ex- periments. The performance of the proposed CDTM model is compared to those of the most important existing state-of-the-art algorithms on real-world datasets. Based on the ex- tensive quantitative and qualitative results obtained, the new CDTM model shows significant improvement over the baseline methods.

The rest of the paper is organized as follows. Section 5.2 reviews related work. Section 5.3 81 introduces the CDTM model. Section 5.4 discusses the inference process of the model. Section 5.5 presents the experimental performance and the paper concludes in Section 4.6.

4.2 Related Work

This section reviews the published research work on this topic. Although few previous studies have focused on exactly the same problem, there are three main branches of research related to this work: traditional topic modeling techniques [34, 21, 58, 105], discriminative topic modeling [54, 97, 91, 93, 94], and methods mining global and local aspects of documents [115, 82, 25, 42, 18].

4.2.1 Traditional Topic Models

Traditional topic models have been widely used to identify the latent topics from documents. In general, topic models can be classified into two categories, depending on whether the approaches are based on matrix decomposition (such as SVD) or are generative models. Probabilistic latent semantic analysis (PLSA) [34, 21] is the earliest such attempt, representing the documents as a mixture of topics and learned latent topics by performing matrix decomposition on the term- document matrix. Similarly, non-negative matrix factorization (NMF) also learned topics through matrix decomposition, applying the constraint that the decomposed matrices only included non- negative values [58]. The generative probabilistic model LDA took a different approach, assuming a Dirichlet prior for the latent topics [10]. Theoretically, LDA-based topic modeling techniques will be able to learn more coherent topics compared to matrix decomposition approaches, as they allow topic mixtures to vary in different documents [105, 74]. Many of these approaches can be used to implement the task mentioned in this paper with special settings, such as LDA [10, 32] and its nonparametric variation HDP [107, 80]. However, the extensive experiments conducted for this study and presented in Section 5.5 demonstrated that these approaches are unable to match the performance of CDTM, since they are not specifically designed for discriminative learning.

4.2.2 Discriminative Topic Modeling

Several discriminative topic modeling methods have been proposed to solve the classification prob- lem. Rosen-Zvi et al. proposed the Author-Topic model, which aimed to find different topic dis- tributions over multiple authors, where each author has a corresponding topic mixture [97]. Based on the Author-Topic model, Lacoste-Julien et al. designed discLDA to study latent topics in order to predict class labels [54]. The main goal of the Author-Topic model is to model the interests of authors, while discLDA is to model the class properties based on content, which is also a special case of the Author-Topic model when each individual document only has one author. To bring more supervised characteristics to traditional LDA, Ramage et al. proposed a variation named la- beledLDA to study the mapping between latent topics and pre-given labels [93]. They then went 82 on to explore the latent relationship between topics and labels with the cost of higher complexity [94]. In summary, these previous works all utilize class labels to study latent topics, enabling the learned topical representations to used for label prediction. Unlike their objective, the method we proposed here aims to discover both the common and the different aspects of document sets.

4.2.3 Global and Local Aspects Mining

Another branch of related studies learns the structures within a single document/collection. Chemudugunta et al. identified background topics and document specific topics via a variation of LDA [18]. Sim- ilarly, Huang et al. recognized local and global aspects of documents and organized these compo- nents into a storyline via optimization [42]. Wang et al. studied the same problem of local/global topic discovery through iterative decomposition towards events [115]. Paul et al. added an aspect variable to the LDA model so that a word may depend on a topic, an aspect, both, or neither [82]. Ge et al. proposed a method to summarize documents into chronicles according to the mapping of their underlying topics [25]. Once again, however, each of these models study the patterns within one document collection, whereas ours seeks to learn the relationships among different data sets. In addition, these existing approach are particular applications designed for specific problem such as chronicles/storyline generation, rather than a general solution for document analysis.

4.3 Proposed Method

In this section, we introduce CDTM, a probabilistic model that aims to identify the common topics shared by multiple data sets, and the distinctive topics representing the features of each data set.

4.3.1 Problem Statement

Suppose there are S = {s1,s2,...,s|S|} data sets, where each data set s contains Ds documents and each document m includes Nm words. The vocabulary of the whole document corpus contains V = {V1 ∪V2 ∪ ... ∪V|S|} terms. A topic is defined as a V-dimensional vector indicating the distribution over words. The goal of this paper is to find Kd distinctive topics for each data set, and Kc topics shared by all S data sets. The notations and variables used in this paper are list in Table 5.1.

4.3.2 Model Definition

Our proposed CDTM model learns common and distinctive topics through a specially designed Bayesian graphical model. The graphical model and generative process are shown in Figure 5.2 and Algorithm 2, respectively. We assume that a document consists of words from two types of 83

Table 4.1: Variable Notations

Nota- Description tion

Ds number of documents in a dataset s Nd number of words in a document d V vocabulary size S collection of datasets Kc number of common topics Kd number of distinct topics hyperparameter for the mixing β 0 proportion for common topics hyperparameter for the mixing β 1 proportion for distinctive topics hyperparameter for the mixing α0 proportion for θc hyperparameter for the mixing α1 proportion for θd hyperparameter for the mixing µ proportion for indicator s hyperparameter for the mixing λ proportion for mixture indicator x Φc mixture component of common topics Φd mixture component of distinctive topics θc common topic mixture proportion θd distinctive topic mixture proportion mixture indicator of common or distinct x topics choice z mixture indicator of topic choice s dataset indicator term indicator for a specific word in one w document 84

S

Kc

c

S*Kd

N D

Figure 4.3: Framework of CDTM model.

ALGORITHM 2: Generation Process of CDTM model Draw φc ∼ Dir(β0) for Kc times; Draw φd ∼ Dir(β1) for |S| ∗ Kd times; for each doc m ∈ [1,M] do Draw s ∼ Mul(µ); Draw θcm ∼ Dir(α0); Draw θdm ∼ Dir(α1); Draw λm ∼ Dir(γ) ; for each word w in document m do Draw x ∼ Mult(λm); if x=0, draw z ∼ Mult(θcm); choose w ∼ Mult(φcz); if x=1, draw z ∼ Mult(θdm); s choose w ∼ Mult((φd); 85 topics: common topics, which are high-level global topics across multiple datasets, and specific topics, which are detailed topics belonging to one dataset. As a result, common topics are fixed for all datasets, while the distribution of specific topics varying with respect to different datasets. Specifically, each word in a document is associated with a distribution of topics. It can either be sampled from a mixture of distinctive topic mixture θd over Kd specific topics, or a mixture of common topics θc over Kc general topics, depending on a binary variable x sampled from a binomial distribution λ. Meanwhile, λ is controlled by a beta prior with a preference parameter λc (for common topics) and λd (for specific topics). The generative process for words in the documents involves three stages.

1. Choose variable x. Per-word variable x is chosen from per-document Multinomial distribu- tion with prior λ. x = 0 indicates that the corresponding word is more likely to be generated from common topics, while x = 1 implies the corresponding word is from a distinctive topic.

2. Choose topic z. After choosing x, topic z for each word w is drawn from Kc common topics if x = 1, while topic z is drawn from Kd discriminative topics if x = 0. 3. Choose term w. Depending on the choices made for x and z, word w is generated either from kc a common topic-term distribution Φc when x = 0 and z = kc, or sampled from distinctive kd topic-term distribution Φd when x = 1 and z = kd.

Like other LDA-based variations, CDTM also models each document as a mixture of topics and words are generated from a specific topic. Unlike other probabilistic models, CDTM can model multiple data sets within a single unified framework and simultaneously identify the common and distinguishing aspects for each data set through multi-layer topics. To achieve this, the CDTM model utilizes two collection-level word distributions (distinctive topic Φd and common topic Φc) to characterize the corpus.

s s 1. Distinctive topic Φd. There are |S|∗Kd different collection-dependent parameter Φd in total, that each collection s has Kd different topics.

2. Common topic Φc. Unlike Φd, there are only Kc common topic parameter Φc shared by all S datasets.

Therefore, the whole corpus consisting of |S| data sets can be modeled through |S| ∗ Kd + Kc top- ics. In addition to the common and distinctive topics, we also design document-level mixtures to capture the variance for each document and data set. Specifically, in the CDTM model, two document-level topic mixtures are governed by a per-document variable that switches the choice between common and distinctive topics.

1. Preference mixture λ. This variable is a per-document beta distribution that controls the value for per-word preference binary variable x. 86

2. Common topic mixture θc. Each document has a common topic mixture θc, which only comes into play when a word’s preference variable x equals to 0.

3. Distinctive topic mixture θd. Similarly, the document-level variable distinctive topic mix- ture θd only applies when a word’s preference variable x equals to 1.

4.4 Inference

Although the exact inference of posterior distributions for hidden variables is generally intractable, the solution can be estimated through approximate inference algorithms, such as mean-field vari- ational expectation [10, 33, 32], Gibbs sampling [27, 88, 15], maximum likelihood estimation [20, 11], and numerical optimization [90, 120]. Gibbs sampling is used for the inference of the proposed CDTM model, as this approach yields more accurate estimations than variational infer- ence in LDA-like graphical models.

4.4.1 Joint distribution

When the data set label s of the document is observed, its labeling prior µ is d-separated from the rest of the model. When s is invisible, the CDTM model can be used to predict the collection label for documents. Based on Algorithm 2 and the graphical model in Figure 5.2, the joint distribution of the CDTM model can be represented as Equation (5.5):

p(w,z,x|α0,α1,γ,β0,β1) = p(α0)p(α1)p(β0)p(β1) M N ∏ ∏ p(wmn|Φc,Φd,zmn,xmn) m=1 n=1 (4.1) M N X (m) (m) M N ∏ ∏ ∏ p(zmn|θc ,θd ,xmn = x) ∏ ∏ p(xmn|λm) m=1 n=1 x=1 m=1 n=1 M M (m) M (m) ∏ p(λm|γ)p(γ) ∏ p(θd |α1) ∏ p(θc |α0). m=1 m=1 m=1

4.4.2 Hidden Variables

The key to this inference problem is to estimate the posterior distributions of the following hidden variables: (1) the topic assignment indicator zmn for words; (2) the common/distinctive prefer- ence indicator xmn for words; and (3) the topic mixture proportion θc, θd and preference mixture proportion λ for documents. As a special case of a Markov chain Monte Carlo, Gibbs sampling it- eratively samples one instance at a time, conditional on the values of the remaining given variables. We only present the result here; the detailed derivation process is omitted due to space limitations. 87

According to Bayes’ rule, the conditional probability of zmn can be computed by dividing the joint distribution in Equation (5.5) by all the variables except zmn. Since zmn is dependent on the value of xmn, the sampling of zmn is discussed separately for two situations: x = 0 or 1. When xmn = 0, which indicates that the topic zmn is chosen from a common topic, the conditional probability of zmn is as follows: v z ncz+β0 ncm+α0 p(zmn = k|w,z¬mn,xmn = 0) ∝ V Kc , v z (4.2) ∑ (ncz+β0) ∑ (ncm+α0) v=1 z=1 v z where ncz is the number of term v choosing common topic z in the whole corpus, and ncm is number of words in document m choosing common topic z. Similarly, when xmn = 1, the conditional v probability of zmn is as shown in Equation (4.3), where ndz is the number of term v choosing z distinctive topic z in the current data set, and ndm is number of words in current document m choosing distinctive topic z:

nv +β nz +α p(z = k|w,z ,x = 1) ∝ dz 1 dm 1 . mn ¬mn mn V Kd (nv +β ) z (4.3) ∑ dz 1 ∑ (ndm+α1) v=1 z=1

Similar to the inference of z, the derivation for the posterior of x is discussed for two cases: xmn = 0 0 or xmn = 1. Specifically, when xmn = 0, the inference is calculated as Equation (4.4), where nm is number of words choosing x = 0 in document m:

p(xmn = 0|w,z,x¬mn) ∝ v z 0 ncz+β0 ncm+α0 nm+γ (4.4) V Kc X . v z x ∑ (ncz+β0) ∑ (ncm+α0) ∑ (nm+γ) v=1 z=1 x=1 1 In the case of xmn = 0, the inference is computed as Equation (4.5), where nm is the number of words choosing x = 1 in document m:

p(xmn = 1|w,z,x¬mn) ∝ nv +β nz + 1 dz 1 dm α1 nm+γ . (4.5) V Kd X (nv +β ) z (nx +γ) ∑ dz 1 ∑ (ndm+α1) ∑ m v=1 z=1 x=1

4.4.3 Multinomial Parameters

Variables Φc, Φd, θc, θd,and λ are multinomial distributions with Dirichlet priors. According to Bayes rule and the definition of Dirichlet priors, these multinomial parameters can be computed from the above posteriors. For example, the common topic word distribution Φcz for term v and the common topic mixture θcz for document m are computed as given in Equation (4.6) and Equation (4.7): nv + β Φv = cz 0 , (4.6) cz V v ∑ (ncz + β0) v=1 88

nz + α θ(m) = cm 0 . (4.7) cz K c z ∑ (ncm + α0) z=1

Variables Φdz and θd can be calculated in a similar way: v n + β1 Φv = dz , (4.8) dz V v ∑ (ndz + β1) v=1 z n + α1 θ(m) = dm . (4.9) dz K d z ∑ (ndm + α1) z=1 Also, the posterior of λ is as follows, where x can be 1 or 0: nx + γ λx = m . (4.10) m X x ∑ (nm + γ) x=1 Since each document is a combination of common topics and distinctive topics, the average topic mixture for document m is therefore calculated on the basis of θcm, θdm, and λm:

(m) 0 (m) 1 (m) θ = λmθc + λmθd (4.11)

4.4.4 Gibbs sampling algorithm

The Gibbs sampling process for then CDTM model is shown in Algorithm 3. The procedure has 5 v v z count variables: n1sz and n0z are matrices with dimension K ×V, n1sm has M rows and Kd columns, x z nm has dimension M × 2, and n0 is Kc dimensional vector. The Gibbs sampling algorithm has three stages: initialization, a burn-in period, and a sampling period. The determination of the optimum burn-in period duration is essential for MCMC ap- proaches. In this paper, we observe changes in the perplexity to check whether the Markov chain has converged. There are several strategies for using the results from Gibbs samplers. One is to read the results from one iteration (e.g., last iteration), another is to use the average of multiple samples. To obtain independent Markov chain states, here we use “sampling lag” to read results, which will leave an interval of I iterations between subsequent chosen samplers.

4.5 Experiments

In this section, our proposed CDTM model is validated using different real-world datasets. The datasets and comparison methods are described in Section 4.5.1, and the performance of the new 89

ALGORITHM 3: Gibbs sampling algorithm for CDTM model

Input: word vectors {w}, hyperparameters α0, α1, β0, β1, and topic number Kc and Kd x z v z v Global data: count statistics {nm}, {n0m}, {n0z}, {n1sm}, {n1sz} Output: topics {z}, indicators {x}; multinomial parameters Φc, Φd, Θc, Θd; and hyperparameter estimations α0, α1, β0, β1 //initialisation; x v z v z zero all count variables nm, n1sz, n1sm, n0z, n0 for each dataset s ∈ [1,S] do for each doc m ∈ [1,M] do for each word n ∈ [1,Nm] in document m (corresponding term of word n is v) do sample indicator xmn = x˜ ∼ Binomial(λm) x˜ increment document-topic count: nm+ = 1 if x==0˜ then sample topic index zm,n = k ∼ Mult(1/Kc) k increment common document-topic count: n0m+ = 1 v increment common topic-term count: n0k+ = 1 if x==1˜ then sample topic index zm,n = k ∼ Mult(1/Kd) k increment common document-topic count: n1sm+ = 1 v increment common topic-term count: n1sk+ = 1

//Gibbs sampling burn-in period and sampling period; for each dataset s ∈ [1,S] do for each doc m ∈ [1,M] do for each word n ∈ [1,Nm] in document m (corresponding term of word n is v) do // for current assignment x and k to a term v for word wm,n ; x˜ decrement count: nm− = 1 if x==0 then k v decrement counts: n0m− = 1, n0k− = 1 sample new indicatorx ˜ via Equation (4.4) if x==1 then k v decrement counts: n1sm− = 1, n1sk− = 1 sample new indicatorx ˜ via Equation (4.5) if x==0˜ then sample new topic k˜ via Equation (4.2) increment counts: nk˜ + = 1, nv + = 1 0m 0k˜ if x==1˜ then sample new topic k˜ via Equation (4.3) increment counts: nk˜ + = 1, nv + = 1 1sm 1sk˜ 90 method is compared with those achieved by other existing approaches using various metrics in Section 4.5.3 and Section 4.5.4. Finally we discuss the application of our proposed method to discover interesting topics using a case study in Section 4.5.5.

4.5.1 Datasets and Experiment Settings

To evaluate our method and other baseline comparisons, three real-world documents datasets are used in our experiments: 20 News group data (20 clusters, 18,828 documents, and 43,009 key- words), Reuters dataset (65 clusters, 8,293 documents, 18,933 keywords), and Four area dataset (4 groups, 15,110 documents, 6,487 keywords). These datasets have been selected for their public availability and wide usage in topic modeling evaluations [54, 48]. Three sub-datasets are formed with different clusters, as shown in Table 4.2. To conduct an extensive comparison, various ratios of common and distinctive clusters are assigned to datasets. In the Reuters data, the number of topics contained in the common cluster is the same as that in each of the exclusive clusters. In the 20 news data, the number of common topics is smaller than the number of distinctive topics, while in the the Four area dataset, the number of common topics is larger than the number of distinctive topics.

For the CDTM model, weak symmetric priors are used for all Dirichlet or Beta parameters: α0 = α1 = 0.1,β0 = β1 = 0.001,γ = 0.5,µ = 0.1. The distinctive topic number Kd and common topic number Kc for each dataset are set as follows: 1) Kd = 3 and Kc = 3 in the Reuters dataset; 2) Kd = 3 and Kc = 1 in the 20 news dataset; 3) Kd = 1 and Kc = 2 in the Four area dataset. The Gibbs sampler is run for 400 iterations, with the first 100 iterations as burnt-in period. All the implementation code and datasets used in this paper will be available once the paper is accepted.

Table 4.2: Datasets

Common cluster Exclusive clusters in subset 1 Exclusive clusters in subset 2 Reuters sugar coffee trade gnp gold ship cpi crude cocoa 20 news Alt.atheism sci.space comp.graphics talk.politics.guns comp.sys.ibm.pc.hardware talk.politics.mideast comp.windows.x talk.politics.misc 4 area Data Mining Machine Learning Database Information Retrieval 91

4.5.2 Comparison methods and validation metrics

The following four methods serve as the baseline methods for this paper, and include all those deemed to be the most relevant approaches for this problem.

• LDA[10]: This is the standard topic modeling approach widely shared in the literature. We ran the LDA method on different subsets separately. For the best results, we used weak symmetric priors in our experiments: α = 0.1 and β = 0.001.

• NMF[58]: This is the most popular topic modeling method based on matrix decomposition. As with the standard LDA, we applied the NMF method separately to each subset for topic discovery.

• discLDA[54]: This is a variation of LDA that is capable of discriminative modeling. There are three tunable hyperparameters α, β, and π in this approach, which here set to 0.1,0.001, and 0.1, respectively.

• discNMF[48]: This is a discriminative topic modeling method based on NMF. To achieve best the performance, we set parameter α to be 100, and β to be 10.

The quality of the topic modeling results are evaluated in terms of different measures: perplexity, accuracy, and NMI.

• Perplexity. Perplexity is a standard metric used to evaluate topic modeling approaches [10, 30], and is typically defined as follows:

 M N  − logP(w )  ∑ ∏ mn  Perplexity(D) = exp m=1 n=1 (4.12) M    ∑ Nm  m=1

where M is the number of documents, wmn is the word vector for document m and Nm is the number of words in document m. A lower perplexity indicates more accurate performance of the model. Here the probability of the words wmn occurring in a document m, given its parameters, can be calculated as follows: ( wmn (m) φczmn θczmn , if xmn = 0 P(wmn) = (4.13) φwmn θ(m) , if x = 1 dzmn dzmn mn

(m) where φwmn and φwmn can be computed through Equation (4.6) and (4.8), while θ and czmn dzmn czmn θ(m) can be calculated through Equation (4.7) and (4.9). dzmn 92

• Accuracy. Clustering accuracy (ACC) quantitatively measures the mapping relationships between result clusters and labeled classes [14]. A larger ACC value means better clustering performance. Given a document m, result label rm, and ground truth label sm, the cluster accuracy is computed as follows: M ∑ δ(sm,rm) ACC = m=1 , (4.14) M where M is the total number of documents, δ(x,y) is a delta function that is equal to one, if x = y, and equals to zero, otherwise. In our evaluation, the ACC metric is used to calculate the quality of clusters, where M is the total number of documents within a cluster in the ground-truth case and δ(x,y) is the number of documents correctly labeled by the methods. • Normalized Mutual Information. Normalized Mutual Information (NMI) is used to mea- sure the quality of clusters, and is typically defined as follows: c c ni, j log n·ni, j ∑i=1 ∑ j=1 n ninˆ j NMI = q . (4.15) c ni c nˆ j (∑i=1 ni log n )(∑i=1 nˆ j log n ) In this paper, NMI is used to evaluate clustering performance, where c is the number of clusters, ni is the number of documents contained in the ground truth label Ci,n ˆ j is the number of documents belonging to result label Li, and ni j is the number of documents that are in the intersection between the result label Ci and the ground truth class L j. Typically, a larger NMI value indicates better clustering performance.

4.5.3 Quantitative Performance

Parameter Sensitivity Analysis

The distinctive topic number Kd is an important parameter for the discriminative learning method discLDA, discNMF, and our proposed CDTM. When keeping the total topic number K = Kc + Kd fixed, Figure 4.4 shows the perplexity of method discLDA, discNMF, and CDTM for different distinctive topic numbers Kd. The correct numbers of common and distinct topics here are both three. Two conclusions can be drawn from this figure.

• Minimum perplexity. Both discNMF and our proposed CDTM show minimum perplexity when Kd is set to be 3, which is the correct number of discriminative topic pairs. However, there is no obvious correspondence between minimum perplexity and correct discriminative topic for discLDA. • Sensitivity. Our proposed CDTM consistently obtains low perplexity with changes in values of Kd. Interestingly, another graphical model, discLDA is the second best performer in terms of variance, while matrix factorization method discNMF seems to be very sensitive to the setting of parameter Kd, which increases dramatically when Kd is set to be 5. 93

In summary, Figure 4.4 shows that CDTM model consistently provides results that are closer to the ground truth, discLDA is also stable with comparatively small variance , although it is hard to assign the right parameter value, while discNMF is a good performer in most cases except for the extreme values.

Clustering Performance

The clustering validation evaluates the quality of the resulting clusters compared to the ground- truth cluster labels. First, the cluster index for each document is computed as the most strongly associated topic index. For our proposed CDTM model, this step identifies the maximal element θdz in vector θd, which can be computed similarly to Equation (4.7). The results of discLDA are computed by jointly considering the transformation matrices and the topic mixture distribution. For the NMF and discNMF methods, this step is implemented by finding the corresponding column vector for factor matrix H. The obtained results are then re-mapped to ground truth labels using the Hungarian algorithm [50]. Two widely adopted cluster quality measures ACC and NMI are used to evaluate the performance; these are listed in Table 4.3.

Table 4.3: The clustering performances achieved by NMF, LDA, discNMF, discLDA, and our pro- posed CDTM measured in terms of accuracy and NMI. Higher values indicate better performance.

Reuters ACC NMI 20 news ACC NMI 4 area ACC NMI

LDA 51.327216 0.143664 LDA 31.141321 0.071605 LDA 52.864178 0.217929 NMF 55.565432 0.220992 NMF 34.387952 0.125356 NMF 44.631722 0.147453 discLDA 55.148413 0.236635 discLDA 33.813746 0.122223 discLDA 46.789613 0.23470 discNMF 54.113611 0.228853 discNMF 35.747743 0.116878 discNMF 38.136921 0.21990 CDTM 56.815345 0.238987 CDTM 40.813682 0.222812 CDTM 55.179331 0.391723

Generally, our proposed CDTM model outperforms all other existing methods (NMF, LDA, disc- NMF and discLDA) in all datasets for both measures ACC and NMI.

• Comparisons among different datasets. In general, the values of measure ACC and NMI for all methods will increase as the actual topic number of each cluster (6 topics for the Reuters data, 4 topics for the 20 news dataset, and 3 topics for the Four area dataset) de- creases. In the Reuters dataset, where the number of common and distinctive topics is the same, the four methods yield very close results. In the dataset for 20 news, as the common topic number decreases, although all methods see an increase in performance, our proposed CDTM model obtains the largest improvement. Also, in the Four area dataset, where the distinctive topic number is less than the common topic number, CDTM is still the best per- former, with much better ACC and NMI than other methods. • Comparisons between CDTM and other LDA-based approaches. LDA, discLDA, and proposed CDTM model are all LDA-based approaches. CDTM is the best performer in all 94

3 datasets. discLDA is less stable, beating LDA in the datasets from Reuters and 20 news, but less well in the Four area dataset. The main difference between CDTM and discLDA is that CDTM is entirely inferred through Gibbs sampling, while parameters in discLDA are estimated as a combination of Gibbs sampling and the EM algorithm. Such combination processes may result in performance instability.

• Comparisons among NMF-based approaches. Both the NMF and discNMF algorithms model topics through matrix factorization. The NMF method is slightly better than discNMF for the datasets from Reuters and Four area, while discNMF performs better for the dataset 20 news. This indicates that discNMF can perform well when there is some imbalance between common and distinctive topics, but degenerates to standard NMF when common and distinctive topics have similar values, it may also suffer from the instability problem due to greater model complexity.

• Comparisons between CDTM and discNMF. In the 20 news dataset, NMF performs better than LDA, and discNMF is better than discLDA. In the Four area dataset, LDA is much better than NMF, and discLDA is much better than discNMF. This indicates that LDA-based approaches can obtain good performance when there are more common topics, while NMF- based approaches can generate good results when there are more distinctive topics in the dataset.

4.5.4 Topic Distributions

To examine the detailed modeling performance, we will look at show the top ranked words for each topic [116, 9]. In this paper, to further analyze the differences between the NMF-based and LDA-based methods, Table 4.4 lists the top-10 ranked words in topics learned by the discNMF model and our CDTM model. In this experiment, we utilized the Four area dataset, because it is a difficult task to differentiate between these four areas due to their high content similarity between any two sub-groups. First, these four areas (Data Mining, Information Retrieval, Machine Learning, and Database) are the closest research areas to data science, a sub-discipline of computer science. Second, in general, both “Information Retrieval” and “Data Mining” (the common topics) are on the basis on “Machine learning”(distinctive topic 1) and “Database”(distinctive topic 2). Two important observations can be made based on the results shown in Table 4.4.

• Distinctive Topic. Both discNMF and CDTM model do well in identifying the distinctive topics “Machine learning” and “Database”. However the results for CDTM are computed through word groups while discNMF is more dependent on one or two of the most repre- sentative words. 1) Both methods identify the important words correctly. For example, these algorithms were able to find the most representative words in “Database”, such as “database”, “xml”, “query”, and “sql”. Also, most of the important topic words are very similar in discNMF and CDTM. For example, 7 out of the 10 top ranked words in “Machine learning” are shared by the two methods. 2) The main difference between discNMF and 95

Table 4.4: Word distributions for topics (10 most likely words) learned by the discNMF model and proposed CDTM model from the 4 area dataset.

Distinctive Topics Machine Learning DataBase discNMF CDTM discNMF CDTM

learning 0.1206531 learning 0.019297 database 0.3846852 data 0.0280814 based 0.0459054 based 0.015685 xml 0.1141720 query 0.0188698 model 0.0386878 using 0.014255 processing 0.0252830 web 0.0171929 using 0.0337021 data 0.012807 management 0.0220824 sql 0.0143184 reinforcement 0.0178361 model 0.012477 querying 0.0157286 database 0.0121842 algorithm 0.0111599 algorithm 0.007233 keyword 0.0135791 mining 0.0114003 classification 0.0111518 search 0.007105 sql 0.0103898 using 0.0107687 approach 0.0104132 information 0.006738 design 0.0094311 xml 0.0099194 network 0.0103787 clustering 0.006738 caching 0.0089737 system 0.0094839 data 0.0088866 classification 0.006720 server 0.0088255 efficient 0.0088088 Common Topics Information Retrieval Data Mining discNMF CDTM discNMF CDTM

data 0.0001323 query 0.005830 model 0.0035363 web 0.0074333 based 0.0001215 xml 0.004480 game 0.0018993 data 0.0055904 query 0.0001214 web 0.004357 robot 0.0016786 mining 0.0042388 web 0.0001213 data 0.004112 planning 0.0016436 query 0.0042388 using 0.0001212 system 0.004112 agent 0.0015628 retrieval 0.0035017 mining 0.0001211 clustering 0.003744 logic 0.0015556 based 0.0033788 system 0.0001206 evaluation 0.003621 process 0.0009576 probabilistic 0.0023959 search 0.0001205 mining 0.002884 kernel 0.0009363 efficient 0.0023959 efficient 0.0001203 summary 0.002639 human 0.0008836 pattern 0.0023959 clustering 0.0001202 search 0.002394 markov 0.0008294 information 0.0022730 96

CDTM is that they assign word weights within each topic differently. Compared to the CDTM model, the discNMF model seems to be more “biased” towards the most important words, with word weight dropping dramatically from the top to the bottom. For instance, in discNMF, the weights of the top ranked words ( “learning” for the topic “Machine Learn- ing”, “database” for the topic “Database”) are 3 times greater than the weighting awarded to the words that came in second ( “based” for the topic “Machine Learning”, “xml” for the topic “Database”). This “bias” weakens the performance of discNMF as the outputs are in fact decided by a relatively small number of words, rather than the word groups used by the CDTM model.

• Common Topic. Compared to the results of the distinctive topics, both discNMF and CDTM produce less obvious results for common topics. Although this is such a difficult task that even well-trained human analyzers find it hard to tell a data mining paper from an informa- tion Retrieval paper, we can still find interesting differences in behaviors of the discNMF and CDTM models when dealing with such tasks. 1) DiscNMF degenerates to “random guess”, while CDTM continues to give a comparatively stable performance. For disc- NMF, the weights of the top ranked words in common topics are far smaller than those of distinctive topics. For instance, the weight of word top ranked word “data” for the topic “In- formation Retrieval” is only 0.0001323, more than 1000 times smaller than the top ranked word “learning” (0.1206531) for the distinctive topic “Machine leaning” and top-1 word “database”(0.3846852) in distinctive topic “Database”. Since the vocabulary size is 8,841, the weight given to the word “data”, 0.0001323, is slight larger than 0.0001131 (1/8,841), which suggests a “random guess”, where the weights are evenly assigned among all words. The CDTM model behaves more stably the weights of the top ranked words in the common topics being around 1/3 of those in distinctive words. 2) Once more, discNMF method tends to find the most representative words while CDTM considers the combined fac- tors of the word group. DiscNMF still tries to find the most significant words, such as word “robot”, “kernel”, and “markov” in the common topic “Data Mining”. It is true that words such as “markov” are more frequently used in data mining papers than information retrieval articles, but they are only contained in a relatively small number of papers, which may therefore fail to tell whether a paper belongs to “Data Mining” or “Information Re- trieval” in most cases. Similar to the case of distinctive topics, the common topics found by the CDTM model are also computed according to the combined factors from a group of words. First, the words found by the CDTM model tend to be more general than those from discNMF. For example, the topic “Data Mining” contains the exclusive words “probabilis- tic” and“pattern”, which are much more widely utilized in data mining papers than the words “markov” and “kernel”. Second, there are some overlaps among the top ranked words (3 out of 10) in both “Data Mining” and “Information Retrieval”, and their different weights reflect the real case properly. For example, in the topic “Information Retrieval”, the word “query” is the most important word. Although this word also appears in the top ranked word list for “Data Mining”, it has a much smaller weight. This phenomenon reflects the true case that both “Data Mining” and “Information Retrieval” are intersections of “Machine Learning” 97

and “Database”, with different emphasis on similar content.

4.5.5 Topic Discovery on Multiple Collections

As mentioned earlier, the proposed CDTM model is capable of handling the case of multiple datasets. This is another advantage the CDTM model enjoys over NMF-based approaches, which can only be extended from two datasets to multi-sets with great difficulty. In this section, we discuss the interesting discoveries that can be made by applying the CDTM model to “shooting” datasets consisting of multiple document collections. This case illustrates how the CDTM model can be applied to conduct “comparative thinking” and discover interesting patterns in real world data. The “shooting” datasets consist of three subsets corresponding to three recent shooting events that occurred in the United States: “California teenager shooting”, “South Carolina Church shooting”, and “Cincinnati shooting”. The goal here is to identify the distinctive topics contained in each document set, and the common topics that are shared by all three document sets. Besides these three events, we also include some documents from other “shooting” events to generate some noise. For parameter setting, we set both common topic number Kc and distinctive topic number Kd to be one here. The result topic distributions are illustrated as word clouds in Figure 4.5.

• Distinctive Topic. As the word clouds show, the top ranked words for each event reveal their characteristics well. For example, in “California teenager shooting”, a teenager named “Rodger” (the biggest word) killed several victims, most of whom are “women”. Location terms are most obvious distinctive features for each of different events, such as “cameo” for “Cincinnati shooting” and “Charleston” for “South Carolina Church shooting”. One inter- esting observation is that the top ranked word list includes a word labeled “YesAllWomen”, which is in fact a hot hashtag in Twitter. Since all our documents are news articles, this phe- nomenon indicates the significant influence of new emerging social media for the traditional news media. A similar conclusion can be drawn from the “South Carolina Church shooting” event, where the word “charlestonshooting” is also a hashtag from Twitter data.

• Common Topics. The common topics shared by these three events (together with the other noisy events we included) reflect the most frequently used words appearing in the “shoot- ing” events. As can be seen from the central word cloud denoting the common topic, the two largest words, “shooting” and “victims”, are the most representative terms for all three shooting events. However, the other top ranked words also provided meaningful insights into these events. For example, the words “black”, “sex”, “girls”, and “college” are among the most important words in the list, which are consistent with the true factor that: many of the shootings are carried out by young college students, and often involving complex discriminations in gender or racism. 98

4.6 Conclusion

In this paper, we proposed a novel probabilistic model CDTM, in order to identify the common and distinctive topics among multiple data sets. Our new CDTM model extends latent variable prob- abilistic methods (e.g., LDA) by allowing the modeling of documents through choices between specific aspects of one data set and common aspects shared by all collections. Extensive experi- ments reveal that the proposed method is indeed capable of identifying clear common and distinct topics for multiple data sets, thus providing meaningful insights massive data. A comparison with existing state-of-the-art models indicates that CDTM is more accurate than other LDA variations, and more stable than the NMF-based approaches. In our future work, we plan to improve the ef- ficiency of the proposed method, so that it can be used on real time data streams. In addition, we paln build a visual analytics system capable of interactively visualizing the common and distinctive topics. 99

discLDA discNMF CDTM

8000

6000

4000

Perplexity

2000

0 1 2 3 4 5

Kd

(a) The perplexity performance vs distinctive topic number Kd of discLDA, disc- NMF, and CDTM. The total topic number K = Kc + Kd is set to be 6. Kd=1 Kd=2 Kd=3 Kd=4 Kd=5

20news_iteration_summary_recon 1200 1050

900

750

600

Perplexity 450

300

150

0 8 16 24 32 40

Iteration (b) The perplexity performance of CDTM model with different distinctive topic number Kd, from iteration 1 to iteration 40. The total topic number K = Kc + Kd is set to be 6.

Figure 4.4: Performance comparison in terms of perplexity. 20news_iteration_summary_recon 100

Distinctive Topic Common Topic Distinctive Topic

Distinctive Topic

Figure 4.5: Case study of gun shooting in United States. Chapter 5

Social Media based Simulation Models for Understanding Disease Dynamics

5.1 Introduction

Since 1976, the seasonal flu killed 500,000 people every year, according to the Centers for Disease Prevention and Control (CDC) and the World Health Organization (WHO) 1. Flu is not only “deadly” but also “expensive”. For example, in the United States, it causes significant economic loss up to $87 billion annually. Further more, with help of modern transportation, these diseases spread much faster and hit larger population today. In March 2009, swine flu first occurred in Mexico and California, and soon reached all over the world as a result of airline travel [26]. How to efficiently monitor and track the dynamics of ongoing epidemic diseases is one of the most crucial challenges in the field of public health. Currently, two related research branches have been working on this challenge, namely, social media mining and computational epidemiology. Traditional computational epidemiology models usually utilize social contact network to simulate the flu spreading process. With detailed representation of underlying social contact networks, these systems are able to “duplicate” real-world disease progression in the virtual world. However, they are highly dependent on surveillance data provided by the Centers for Disease Control and Prevention (CDC) to estimate parameters, which results in two following limitations 2. 1) Low effectiveness. CDC surveillance data is updated once per week, with at least one week delay in real- time disease transmission. Such outdated data can hardly achieve good performance in monitoring the rapidly spreading epidemics. 2) Insufficient accuracy. CDC provides surveillance data at the state-level, with not much detailed information for subregions such as counties. The granularity of these data is too fine to tune accurate parameters for model estimation. On the other hand, newly emerging social media mining techniques can collect real-time disease

1http://www.who.int/mediacentre/factsheets/fs211/en/ 2http://www.cdc.gov/flu/weekly/fluviewinteractive.htm

101 102

Simulation Space infected recovered home vaccinated isolated

shop school Ivan Cheung @icheung 9h Just returned from @FairfaxHospital , got the flu Social Media Space Julia Berger @Jgb.54 8h Just got my flu shot from @rxjude ouch!!

Will Larsen @wilarseny 6h get the flu, in bed for 3 days.

Polar Ken @mrpolarken 11h Real World I felt so tired. I am sicksick??

Figure 5.1: SMS model consists of “social media space” and “simulation space”. Both of them can be considered as subsets of the real world. 103

data through posts of online users. As can be seen from Figure 5.1, many social media users publish messages that can reflect their health status. Further more, besides plain texts, these mes- sages often contain time and location information. Such multi-aspects data not only provide new opportunities to the identifications of individual, but also benefit population-level analysis such as spatiotemporal forecasting for influenza outbreaks. However, most social media mining tech- niques are purely data-driven methods, and do not have a clear understanding of the underlying social contact network in the disease diffusion. As later demonstrated in Section 5.5, social media mining methods are “shortsighted” in nature. They are good at real-time detection and short-term prediction, since they can utilize the most up-to-date social media data. However, they perform poor in long-term disease forecasting, because they ignore the inherent features of disease and therefore fail to model their spreading process. As discussed above, computational epidemiology models can capture the diffusion patterns of dis- ease spread through detailed simulation of real world, but their “intelligence” has not been fully developed due to limitations of CDC data in effectiveness and accuracy. On the contrary, social media mining methods can utilize the most updated user-provided data, however, lack of global knowledge in disease modeling. In this paper, we propose a novel Social Media based Simulation (SMS) model which combines the advantages of computational epidemiology approaches and so- cial media mining techniques into one unified framework. Specifically, as shown in Figure 5.1, the proposed SMS model considers online posts from users in social media space as well as underlying social contact network in computational simulation space. In the social media space, SMS model infers users’ health status through their posts. First, SMS model is able to identify infected users through tweets such as “4th day with flu”. Second, this model is also capable of identifying potential patients in their incubation period through tweet such as “I felt so tired. I am sick?” These individual posts are then analyzed and aggregated into population-level parameters for simulation space. Based on the detailed social contact network, the disease propagation process is optimized in the simulation. After that, the outputs of computational part are fed into the social media space as the prior knowledge for learning in the next iteration. This iterative feedback mechanism benefits the learning for both the spaces, and therefore perfectly tackle the challenges that previous social media mining methods and computational epidemiology models can not deal with. The major contributions of this paper are summarized as follows.

• A unified framework that jointly models social media mining and epidemiology simula- tion is proposed. The proposed SMS model will collect and analyze the most updated data from social media, at the same time, capable to infer the underlying propagation process like a standard computational model. • A “dual-space” learning model is developed for mining the disease diffusion patterns. Our SMS model consists of two spaces: social media space and simulation space. Different methodology is adopted in different spaces for optimal performance. Meanwhile, informa- tion is shared efficiently across the spaces with carefully-designed learning strategies. • A novel learning algorithm consisting of multiple inference technologies is developed. 104

A variety of learning approaches are incorporated into the SMS model, including Gibbis sampling, maximum likelihood estimation, and numerical optimization. • Extensive experiments have been made to demonstrate the effectiveness of the proposed SMS model. SMS model is tested on large-scale datasets compared to four existing state-of- art algorithms. With extensive quantitative and qualitative experimental results, SMS model shows significant improvement over both social media mining methods and computational epidemiology models.

The rest of the paper is organized as follows. Section 5.2 reviews related work. Section 5.3 intro- duces the proposed SMS model. Section 5.4 discusses the inference of the SMS model. Section 5.5 presents the experimental performance analysis and provides quantitative comparisons with various other methods available in the literature. The paper concludes in Section 6.

5.2 Related Work

This section reviews research literature related to our work. The first branch is computational approaches that have been long studied in the area of epidemiology modeling [46, 113, 28, 23]. Another direction includes simulation-based methods that represent population as social contact networks to study the spread of diseases within the network, using individual-based simulations [4, 3, 8]. More recently, the emerging social media platforms such as Twitter and Facebook prompt the development of epidemic knowledge mining methods based on analysis over social media data. This field can be further divided into two sub-branches: volume analysis on aggregate-level statistics [2, 31, 22] and semantic analysis for detailed health information [83, 19, 13]. Population-based Models: Traditional population-based computational models for epidemiology usually divide the targeted population into compartments according to health status and demo- graphics, and then utilize differential equations to model the spread of disease [46, 113]. Besides, individual-based computational models are designed for network epidemiology, where the epi- demiology process is modeled as stochastic propagation over contacts network between individuals [28, 23]. Random graph model is widely used by such “ad-hoc” models. Individual characteris- tics such as age, sex, personal relationships, and locations are first translated into parameters for the random graph models. And then these models are fit into real-world data to obtain the best parameter values. Simulation-based Models: Simulation-based models usually build a synthetic population that simulates the features of real population. Specifically, each person in such a system will be as- signed geographical, social, behavioral, and demographic attributes (e.g., age and income) [8]. And the social contact network is simulated through assigning daily activities and locations for each node (person) in the network [8, 3]. The epidemic dynamics are then modeled as diffusion processes across the network, which enables the computation of infectious time and location for all individuals. 105

Social Media Mining: Social media users may report their symptoms through online posts, which are known to be the best signals for early disease detection, even before diagnoses [49]. Several attempts have been made to track disease outbreaks through studying the relationship between aggregate volume of flu-related social media posts and CDC data [2, 31, 22]. They usually first identify flu-related tweets by keyword selection and then try different regression models to corre- late the tweet volume and CDC statistics [31, 22]. Other methods mainly focus on analyzing the semantics of tweets to reveal their relevance to epidemic topics such as public health [83], health behavior [19], and disease spread [13]. Paul et al. [83] proposed a Bayesian network model to distinguish ailment topics from general topics. Brennan et al. [13] utilized interpersonal interac- tions to predict the disease spread between cities. To estimate the overall trends, Chen et al. [19] proposed a topic model to capture users’ hidden states from tweets and aggregate individual states into a geographical region states. From the above discussion, traditional population-based approaches or simulation-based models focus on capturing the characteristics of flu diffusions, while social media mining methods empha- sizes discovering of the latent patterns inside the data. Different from all these works, our proposed SMS model is a hybrid model that simultaneously serves as a text mining model in social media data and a computational model aware of underlying human contact network.

5.3 The Proposed SMS Model

With inputs from social media data, this paper aims at capturing the diffusion patterns of epidemics across contact network. Specifically, in period T = {0,...,t,...,T}, the overall objective can be formally defined as: estimate the health states SV ,t at each time stamp t for the population V in the region of interests, using the social media data streams U as inputs. To achieve this goal, our proposed SMS model integrates two spaces (simulation space and social media space) within one framework as shown in Figure 5.2. In this section, we first introduce the independent learning process within each space, and then present the information sharing mechanism between the two spaces. The notations used in this paper are summarized in Table 5.1.

5.3.1 Learning in Social Media Space

S We define social media data as D = u∈U,t∈T Du,t, where Du,t is the post of user u at time t. Note that, multiple posts of user u within time interval t are integrated as one document m. In the well- known SEIR model, each person is assumed to be in one of the following states: susceptible (S), exposed (E), infectious (I), and recovered (R). Generally, the individual shows no symptoms in the susceptible (S) and recovered (R) status, has been infected but has not yet infectious in the status exposed (E), and suffers from severe symptoms in the status infectious (I). Social media users will not post content related to disease in status S and R, since no symptoms are shown in these two status. Therefore, our work assumes a user can be in one of following three health states: healthy 106

Simulation Space pe ωe

pi ωi

t-1 t Social Media M M Space Healthy

β Susceptible

Φ Infected N S N

Figure 5.2: Overall Framework of the SMS model. 107

Table 5.1: Mathematical Notation

Nota- Description tion u user in social media space v person (node) in simulation space t time stamp U social media population of targeted region V simulation population of targeted region T set of all time stamps Su,t health status of user u at time t hu,t user u‘s health status indicator at time t eu,t user u‘s exposed status indicator at time t iu,t user u‘s infected status indicator at time t G contact network E edges in the contact network W weights for edges in the contact network D social media data streams transmission probability per contact time τ unit pI infectious period pE incubation period s choices of label for words z topic indexing w words in document exposed population at time t in simulation ρ t,e space infected population at time t in simulation ρ t,i space σ control parameter document distribution of label µ assignments θs document topic distribution under label s the mixture component of words in topic z φ s,z with label s Dirichlet parameter for document topic α mixture Dirichlet parameter for word-topic β mixture γ Dirichlet parameter for label mixture ωe Dirichlet parameter for incubation period ωi Dirichlet parameter for infectious period 108

(S and R of SEIR), exposed (E), and infectious (I). As shown in Figure 5.2 and Algorithm 4, SMS model learns health status from social media data through a specially designed Bayesian graphical model. The generative process of words in our model for social media posts consists of three stages. First, the health status s is chosen from per-document multinomial distribution with prior µ: s = 0 indicates the user of the corresponding post is healthy (S and R of SEIR); s = 1 implies the user is exposed but has not been confirmed as infected (E), for example, “I feel so tired all day even with 9 hours sleep”; and s = 2 denotes the user has been infected (I), with words such as “get the flu, in bed.” Second, after chosing the value for s, topic z is drawn from K-dimensional topic mixture θs. Different from other topic models, each document here is associated with S topic distributions. This scheme enables the prediction of the heath status based on the extracted topics. Finally, a word is generated from word distribution φs,z, conditioned on both topic z and health status s.

ALGORITHM 4: Generation process of words in social media space of SMS model. for each label s = 1,2,...,S do for each topic z = 1,2,...,K do Draw φs,z ∼ Dir(β); for each time stamp t = 1,2,...,T do for each document Du,t = 1,2,...,U do Draw µu,t ∼ Dir(γ) ; for each label s = 1,2,...,S do Draw θu,t,s ∼ Dir(α) ;

for each word w in document Du,t do Draw s ∼ Multi(µu,t) ; Draw z ∼ Multi(θu,t,s) ; Draw w ∼ Multi(φs,z) ;

Besides, we define a multinomial variable Su,t = (hu,t,eu,t,iu,t) to denote the health status of each user u at time t in social media space. In this vector, only one of the three elements in the vector equals to 1 and all remaining elements equal to 0. Specifically, hu,t = 1 indicates the user u is healthy, eu,t = 1 denotes the user u is exposed to the disease, and iu,t = 1 means the user u becomes infectious. Su,t can be viewed as a “summary” of variable s that: variable s indicates the status for each word, while Su,t indicates the health status for each user. Therefore, the values of elements in Su,t can be computed through posterior distribution µu,t that: the s-th (s = 0,1,2) element in variable Su,t is 1, if and only if argmax(µu,t) equals to s. 109

5.3.2 Learning in Simulation Space

Simulation space is a contact network G = (V ,E,W ), where V is the targeted population, E is the edge set, and W are weights for edges. Specifically, node v1 ∈ V in the network denotes a individual, who has a contact with another individual v2 through edge (v1,v2) ∈ E, with contact duration equal to w(v1,v2). Under the contact network G, person v2 can be infected by person v1 with probability p(w(v1,v2),τ), where τ is the transmission probability per contact time unit. Similar to health status of social media users, we assume each person v in the simulation world is associated with three status: healthy (S and R), exposed (E), and infectious (I). Incubation period pE(v) and infectious period pI(v) denote the duration of exposed status and infectious status for person v, respectively. To minimize the inconsistency of social media space and simulation space, the hidden health states calculated by the simulation should be consistent with those from social media. Although it is im- possible to map each person v in the simulation space to a specific user u in the social media space, linking the two spaces at the population level is practical and sufficient for our task. Specifically, we compare the social media users with simulation persons within the same region (e.g., counties or states), which can be formalized by the following loss function:

T V U 2 L = min ∑ || ∑ Iv,t(G, pE, pI,τ) − ∑ Iu,t|| τ t=1 v=1 u=1 T V U (5.1) 2 + ∑ || ∑ Ev,t(G, pE, pI,τ) − ∑ Eu,t|| . t=1 v=1 u=1

Iv,t(G, pE, pI,τ) is the overall infectious state of simulation results at time t, and Ev,t(G, pE, pI,τ) is the corresponding incubation state. Here the transmission probability τ is the parameter needed to be optimized to achieve the best performance.

5.3.3 Interaction between two spaces

We will now introduce the procedure for generating the parameters that are required by simulation space from social media space, and then discuss the mechanism of using the simulation outputs to estimate the priors needed by social media learning model. The key to the information transferring from social media space to simulation space is to find a way to aggregate individual-level social media posteriors into population-level parameters. In Equation (5.1), pE and pI are input parameters required by the simulation space. The specific incubation period pE(v) and infectious period pI(v) for each individual v can be viewed as observations from multinational distributions multi(pE) and multi(pI). As mentioned above, although it is unrealistic to link each user u in social media space to each individual v in simulation space, the estimation based on population-level is sufficient for our task. The maximum likelihood solutions for pE t t is thus calculated as the expectation of social media users’ incubation period nE/|U|, where nE 110

denotes number of users whose incubation period is equal to t days. The estimation of parameter pI can be calculated in a similar manner. Conversely, the simulation outputs can also be used to improve the learning performance in social media space. On one hand, in social media space, the ideal values for Dirichlet prior γ of healthy status s should reflect the health status for the population. On the other hand, the simulation outputs include health status of the population. Specifically, two transition parameters, the incubation rate ρt,e and the infectious rate ρt,i are defined to denote the ratio of exposed and infectious persons among the entire population, respectively. These values are calculated as shown in Equations (5.2) and (5.3): V ρt,e = ∑ Ev,t(G, pE, pI,τ)/V , (5.2) v=1 V ρt,i = ∑ Iv,t(G, pE, pI,τ)/V , (5.3) v=1 where Ev,t(G, pE, pI,τ) and Iv,t(G, pE, pI,τ) are outputs from simulation space, as mentioned in Equation (5.1). Gamma prior for a Dirichlet parameter of healthy status s (s can be e or i) at epoch t is therefore computed as follows:

γt,s ∼ Gamma(σρt,s,σ), (5.4) where the mean is proportional to the simulation output parameter σρt,s, while parameter σ controls the consistency of the prior.

5.4 Model Inference

Although exact inference of posterior distributions for hidden variables in the SMS model is gen- erally intractable, the solution can be estimated through approximate inference algorithms, such as variational expectation [10, 33, 32], Gibbs sampling [27, 88, 15], maximum likelihood estimation [20, 11], and numerical optimization [90, 120]. First, Gibbs sampling is used for the inference of the proposed text mining model in social media space, as this approach can yield more accurate estimation than variational inference in LDA-like graphical model. Second, maximum likelihood estimation (MLE) is adopted to estimate incubation period pE and infectious period pI. And the operations in the simulation space are optimized through Nelder-Mead method [55, 79]. Using Algorithm 4 and the graphical model in Figure 5.2, the joint distribution of SMS model in social media space can be represented as Equation (5.5): 111

P(w,z,s|α,γ,β) M N = ∏ ∏ p(wmn|smn,zmn) m=1 n=1 M N M N smn (5.5) ∏ ∏ p(zmn|θm ) ∏ ∏ p(smn|µm) m=1 n=1 m=1 n=1 M S M smn ∏ p(µm|γ) ∏ ∏ p(θm |α)p(γ|ϕ,σ). m=1 s=1 m=1

The key to this inferential problem is to estimate the posterior distributions of the following hidden variables: (1) topic assignment indicator zmn for words; (2) label assignment indicator smn for words; (3) topic mixture proportion θmsz and label mixture proportion µms. The last term p(γ|ϕ,σ) of Equation (5.5) is as follows using Equation (5.4):

σϕ −1 σσϕs γ s exp(−σγ ) p(γ|ϕ,σ) = ∏ s s , (5.6) s Γ(σϕs) where Γ(·) is the gamma function. From the joint distribution, the full conditional distribution for a word term i = (m,n) can be derived, where i denotes word n in document m. As a special case of Markov chain Monte Carlo, Gibbs sampling iteratively samples one instance at a time, conditional on the values of the remaining variables. We only present the result here; the detailed derivation process is omitted due to space limitations.

nv + β nz¬i + α p(z = k|w,z ,s) = sz¬i ms (5.7) mn ¬i V K v z¬i ∑ (nsz¬i + β) ∑ (nms + α) v=1 z=1

v In the above equation, V is the size of the vocabulary, K is the number of topics, nsz¬i is the number of of topic z and label s assigned to term v in the scope of the whole data set, without current z¬i instance i and its topic assignment. nms is the number of words selecting label s and topic z in document m except current instance i

nv + β nz + α p(s = s|w,z,s ) ∝ s¬iz ms¬i (ns¬i + γ). (5.8) mn ¬i V K m v z ∑ (ns¬iz + β) ∑ (nms¬i + α) v=1 z=1 v Similar to the inference of z, ns¬iz is the number of of topic z and label s assigned to term v in z the scope of the whole data set, without current instance i and its label assignment, nms¬i is the s¬i number of words choosing label s and topic z in document m except current instance i, and nm is the number of words (remove instance i) choosing label s in document m.

Parameters Φszv, θmsz, and µms are multinomial distributions with Dirichlet priors. According to Bayes rule and the definition of Dirichlet prior, these multinomial parameters can be computed from the above posteriors: 112

nv + β Φ = sz , (5.9) szv V v ∑ (nsz + β) v=1

nz + α θ = ms , (5.10) msz K z ∑ (nms + α) z=1

ns + γ µ = m . (5.11) ms S s ∑ (nm + γ) s=1

The optimal values of transmission rate τ are searched using Nelder-Mead optimization method, since solving for τ with respect to loss function L in Equation (5.1) is a non-convex and non- differentiable problem. SMS model is a semi-supervised learning approach. In the training process, SMS model is fed with labeled tweets (health states). The trained model M of text part in social media space of SMS contains distribution of words φs for health state s. With trained model M , SMS model can be used to estimate the posterior distributions of health state s˜ of unlabeled Twitter streams. In order to achieve this goal, we follow the approach introduced in [106] to run the inference process on the new documents exclusively. Inference for this testing process corresponds to Equation (6.4) and (6.1) with the difference that: the current Gibbs sampler is run with φs fixed. In the initial stage, the algorithm randomly assigns switch variables to words. Then a number of Gibbs sampling updates are made to estimate the posterior.

s p(s˜ = s|w˜ = v,s˜¬i,z˜,M ) ∝ φs,v(nm,¬i + γ) (5.12)

5.5 Experimental Results

In this section, we first describe the data preparation, the metrics used for evaluation, and the settings for all the comparison methods. After that, our proposed SMS model is compared with existing state-of-the-art algorithms on real-world data sets.

5.5.1 Datasets

Twitter data used in this paper consists of two parts: training set D1 and testing set D2. The training set D1 was collected using the following steps: 113

Figure 5.3: Performance in terms of Pearson correlation in MA and MD states for 2012-2014 data.

1. Twitter stream data collection. Twitter data streams were retrieved through REST API using flu related keywords, such as “flu”, “h1n1”, and “influenza”. The keyword lists are provided by Paul and Dredze [83].

2. Identify tweets health status. We asked human annotators to create the labels for the tweets. Each annotator selected a label from status “healthy”, “exposed”, and “infected” for each tweet. A label was confirmed only if it was chosen by at least 2 annotators.

The testing data set D2 was created as follows, which shares the same users U with D1.

1. Extract users. Users U of tweets in the training set D1 were extracted from data streams. 2. Retrieve tweets. Retrieve posts belonging to authors U, which were published two weeks before and after the time span of dataset D1. 3. Geocoding. Conduct geocoding on tweets to identify location information such as GPS tags using Carmen geocoder1.

4. Data clean. Remove retweets and only keep tweets within the targeted regions.

In summary, 16,864 tweets were collected in training dataset D1. Testing set D2 contains 19,785,147 tweets published by 15,005 users in Maryland (MD) and Massachusetts (MA) from August 2012

1https://github.com/mapbox/carmen 114

to July 2014, where 70% of the tweets were assigned with locations. It should be noted that D1 and D2 share the same set of users.

Figure 5.4: Performance in terms of peak time in MA and MD states for 2012-2014 data.

5.5.2 Labels and Evaluation Metrics

In this paper, the ground truth influenza data used for validation is provided by the Centers for Disease Control and Prevention (CDC), which contains the percentage of weekly physician visits related to influenza-like illness (ILI) for most regions in the United States. In this paper, three different widely used metrics for evaluating the prediction performance are adopted: mean squared error (MSE), Pearson correlation, p-value, and peak-time error.

• Pearson correlation: Pearson correlation is the covariance of predicted results and the ground truth divided by their deviation product. It measures the linear relationship between variables, with values varying between +1 and -1. Value 1 indicates the positive linear cor- relation, value -1 denotes the negative linear correlation, and 0 means no linear correlation exists. The larger Pearson correlation value implies the stronger positive linear correlation between two variables.

• Mean squared error: MSE is the mean of squared error between the predicted results to the ground-truth class labels. MSE is always non-negative, and a smaller value of MSE indicates less errors between the predicted results to the ground truth. In the optimal case, MSE can be close to zero.

• Peak-time error: Peak-time error is the difference between predicted peak time (the week with largest infected population) and the actual peak time. A smaller peak-time error indi- cates better forecasting performance. 115

5.5.3 Comparison Methods

The proposed SMS model is compared with four other models, including 2 social media mining methods (LinARX and LogARX) and 2 computational epidemiology models (SEIR and EpiFast).

• LinARX [1]: This method uses standard autoregressive exogenous model to explore the dependence between influenza-like illness (ILI) visits and social media data time series. The orders of LinARX for the Twitter data time series and CDC time series are set as 2 and 3 based on cross-validation.

• LogARX [2]: The LogARX model evolved from LinARX, where an additional logit func- tion transformation is introduced in order to enforce 0-1 classification boundary for ILI visit percentage. The orders of LogARX for both time series (CDC and social media time series) are set as 2 based on cross-validation.

• SEIR [76]: This method models epidemic dynamics into four health states: susceptible (S), exposed (E), infectious (I), and recovered (R). The volume of the positive tweets classified was fed into above mentioned LinARX model. The orders of the LinARX model for both time series (Twitter data and CDC data) were set as 2 based on cross-validation.

• EpiFast [6]: This model simulates disease propagation in a social contact network. Nelder Mead method [6] is adopted to minimize the error between predicted results and actual ILI visit percentage.

5.5.4 Results

In this section, models were compared by the percentage of ILI visits, with lead times varied from 1 week to 20 weeks. The results are validated in terms of three evaluation metrics introduced above for two states (MA and MD).

Performance on Pearson correlation.

The forecasting performance in terms of Pearson correlation in Massachusetts (MA) and Maryland (MD) is reported in Figure 5.3. In general, SMS model yields the best overall performance in con- sideration of Pearson correlation, methods based on social media mining (LinARX and LogARX) can achieve better performance than computational epidemiology methods (SEIR and EpiFast) for short times, but computational models show their advantage with the increases of lead times. As shown in Figure 5.3, the Pearson correlations of social media mining methods are high when the lead time is small, for example, less than 2 weeks. However, their performance decreases quickly with the increase in lead time, and almost reaches zero at the lead time of 20-week. On the contrary, although the computational epidemiology methods perform worse than social media 116

MA MD Average 2012 2013 2014 2012 2013 2014 LinARX 9.87E-05 6.65E-04 8.04E-05 1.61E-04 8.19E-04 4.62E-04 LogARX 6.45E-05 5.51E-04 3.10E-04 9.53E-05 5.00E-04 3.87E-04 EpiFast 9.65E-04 2.24E-03 2.09E-04 3.02E-05 5.14E-03 9.09E-04 SEIR 9.15E-05 3.73E-04 1.49E-04 1.04E-04 4.61E-04 2.68E-04 SMS 2.20E-05 2.38E-04 1.30E-04 2.51E-05 2.63E-04 1.98E-04 MA MD Variance 2012 2013 2014 2012 2013 2014 LinARX 5.11E-09 1.40E-07 5.22E-09 1.12E-08 1.61E-07 5.96E-08 LogARX 9.26E-10 5.75E-08 2.72E-08 3.22E-09 3.21E-08 3.48E-08 EpiFast 3.24E-07 8.54E-07 4.49E-09 4.94E-10 3.11E-05 2.25E-07 SEIR 4.05E-09 2.90E-09 4.18E-09 6.58E-09 2.36E-08 3.41E-08 SMS 4.76E-11 3.62E-09 3.85E-09 3.11E-11 5.65E-09 8.57E-09

Table 5.2: Performance in terms of mean square error in MA and MD states for 2013 data. The best performers are marked bold, the corresponding second best performers are marked with underlines. mining techniques at shorter lead times, they become more stable as the lead time increases. Our SMS model has a comparable initial performance with social media mining methods, and outper- form them significantly with a large margin when the lead-time is over 10 weeks. Generally, as shown in Figure 5.3, SEIR model achieves the closest performance with our proposed SMS model in terms of Pearson correlations. These observations from Pearson correlations confirm that our proposed SMS model is the best performer over all other methods, social media methods are good in predicting the near future, and the computational models are better for long-time forecasting. These phenomena are inevitable, driven by the underlying natures of different methods. Social media mining methods are highly re- lying on real-time data. Such dependence leads to their good performance of predicting outcomes in the near future, but results in their inability in achieving long-term stability. Computational epi- demiology methods, on the other hand, use CDC data which inherently have 1-2 week time lag. Thus they are less sensitive to the current data and perform worse than social media approaches in forecasting near future. But they can model long-time disease spreading patterns across con- tact network, and therefore obtain more robust overall performance. SMS model benefits from combining utilizing real-time data and the use of long-term progression mechanism, and therefore achieves the best performance. 117

Performance on mean squared error and peak-time error.

Table 5.2 shows the mean squared errors (MSE) for five methods. Each term in the part “Average” is the mean value of MSEs with different lead times, while each value in the part “Variance” mea- sures the variance among various lead times. SMS model is the best performer in most cases. It achieves the smallest average mean and the least variance in 5 out of 6 datasets. Generally com- putational epidemiology models are better than social media mining methods. In both “Average” and “Variance”, four out of six second best performers are computational epidemiology methods (either SEIR model or EpiFast model). But computational epidemiology models are “sensitive” to data that they are good performers in some datasets but not in the other datasets. Taking EpiFast for example, this method has good results in dataset “MD 2012”, where it obtains the second best average MSE and variance. However, it is the worst performer in other datasets. Figure 5.4 displays the performance of peak-time errors, the difference between predicted and actual peak time. Due to page limitation, only results of year 2013 are report here, and simi- lar patterns can be seen in other years. Slightly different from other metrics, at almost all lead times, computational epidemiology models generally are better than social media mining methods in terms of the peak-time error. They can achieve smaller errors for most lead times (from 5 weeks to 20 weeks) than social media based models. This is because peak-time prediction is decided by massive data points rather than isolated moments, which requires more prior knowledge com- pared to other measurements. For very small lead times, for example, below five weeks, social media mining approaches can obtain performance closer to that of SMS model and computational methods. However, for lead times greater than five weeks, social media mining methods tends to yield bigger and bigger errors, while SMS model and computational methods remain more stable. Similar to the results shown in Figure 5.3, our proposed SMS model is closer to the social media mining methods in short-times (e.g., shorter than 5 weeks), and becomes closer to computational methods as the lead time increases. Obviously, social mining methods are not as “clever” as SMS model since they are purely driven by the data, while our SMS model always “smartly” chooses pattern that can yield better results.

5.6 Conclusion

This paper described a novel framework for forecasting disease spread on large-scale social con- tact network. The proposed SMS model can analyze the semantic meaning of the social media data to infer users’ health status through Bayesian inference model, and aggregate the individual results into population-level parameters required for simulation. Specifically designed interaction scheme between the social media space and simulation space can enhance the performance for the both parts. Our extensive experimental results show that the SMS model has obvious advan- tages over both computational epidemiology models and social media mining methods. On one hand, by monitoring social media data, SMS model can infer the most up-to-date health status for social media users and integrate them into population-level parameters. On the other hand, SMS 118 model can maintain good long-term (more than 10 weeks) prediction performance, as well as other computational methods, through its powerful simulation component. Chapter 6

Automatical Storyline Generation with Help from Twitter

6.1 Introduction

Many philosophers such as Nietzsche believe that: nothing exists in isolation – all things are in- terrelated and interdependent. In the era of information explosion, although searching engines such as Google can help users reach the information of a specific event, there is still a lack of technics that can help ordinary users identify the underlying relationships between “isolated” in- cidents. Storyline generation is one of such technologies that give people useful insights toward better understanding of the world. Organizing massive documents into the form of storylines can provide users with structured sum- maries for given subjects, showing the evolution process of relevant events. Unfortunately, de- tecting storylines is never a easy task. Some researchers have tried generating storylines from unorganized documents, but most of these studies were based on unsupervised clustering tech- niques [100, 121]. These methods can separate irrelevant storylines easily (e.g., “Sports” and “Earthquake”), however, they perform poorly in distinguishing storylines with overlapped events. As shown in Figure 6.1, an “Earthquake” storyline may share many common factors with a “Ter- rorism” storyline. For instance, both of the two storylines may involve aspects such as how many people died or injured (casualties) and how to save more lives (rescue). Previous approaches often fail to differentiate these overlapped storylines since they merely connect documents based on sim- ilarity metrics, without capturing knowledge of the hidden events and topics within the storylines. Bayesian models such as LDA [10] are proved to be effective in learning hidden factors. Com- pared to clustering based approaches, few studies have been conducted in storyline generation with Bayesian networks. And the existing work often ignore the structure of storylines [42], or fail to model the hidden relations properly [127]. In this paper, we propose a hierarchical Bayesian model for Automatic Storyline Generation (ASG). As shown in Figure 6.1, ASG is the first storyline

119 120 model with the three-level structure: storylines are root nodes, event types lie in the second-level, and the finest granularity is topic. In ASG model, different storylines can share common event types, and events can be viewed as various combinations of topics. For instance, both the sto- rylines shown in Figure 6.1 include event types “Rescue” and “Casualties”, and both these event types incorporate words from Topic 2 and Topic 3. ASG model also captures the relationships across the layers through two specially designed matrices. This is the first scheme by now that can quantitatively measure the hidden relations between storyline and its hidden factors.

Background Storyline Event Topic

Causalties Topic 1

Background Rescue Topic 2

Earthquake Investigation Topic 3

Attack Terrorism Topic 4

Protest Other Topics

Figure 6.1: An example of the storyline-event-topic hierarchical structure of ASG.

To further improve performance, ASG model uses the Twitter hashtags created by users as labels to “supervise” story generation in long news reports. Nowadays, “sharing to Twitter/Facebook” options are embedded into each news article posted on the web sites of major news media, such as CNN and BBC. When Twitter users share these articles from the original web site or retweet related tweets from their friends, they create special terms that start with #, the so-called “hashtag”, 121 to denote the topics/trends of their posts. Although Twitter data is so noisy that most existing storyline generation tools are unable to cope with it adequately [65, 101], these user self-created hashtags effectively provide human annotations for long articles. The major contributions of this paper are summarized as follows:

• A novel Bayesian model is proposed to capture the features of real world events. ASG model represents storyline as a three-layer structure, and provides solutions to measure hid- den relations among storylines, events, and topics.

• Human input is incorporated into the storyline generation process. The rich up-to-date Twitter data provide the “cheapest” human made labels (hashtags), since they are publicly accessible. And ASG easily improves its efficiency by using these user-created Twitter hash- tags to filter redundant event types.

• An efficient Gibbs sampling inference is provided for the proposed ASG model. Gibbs sampling was chosen for the inference and parameter estimation of ASG model for its high accuracy in estimations for LDA-like graphical model.

• The effectiveness of the proposed ASG model is demonstrated through the comparison with existing state-of-the-art algorithms. ASG model is tested on large datasets associated with real world events. With extensively quantitative and qualitative results, ASG model shows significant improvements over baseline methods.

6.2 Related Work

To the best of our knowledge, this is the first attempt to generate storylines for long articles utilizing knowledge from social media, but there are several lines of related research such as topic tracking, news & Twitter modeling, and storyline discovery. Topic tracking methods aim to identify hidden topics and track topical changes across time. Most of the earlier work in this area, for example DTM [9], estimated current topic distribution through parameters learned from the previous epoch. In addition to methods based on Markov assumptions, there has been some work modeling the evolution of topics using time stamps generated from continuous distribution [116]. TAM model [47] is a hybrid of these two approaches, which captures changes via a property dubbed a “trend class”–a latent variable with distributions over topic, words, and time. However, the granularity of “epoch” or “trend” in topic tracking approaches is inherently too fine to be suitable for the storyline discovery task. Twitter is a newly emerging platform for news spreading [53], which covers almost all domains of newswire events [70]. Approaches that combine news and social media data in one joint model have been proposed to improve Twitter topic modeling performance, by “transferring” knowledge learned from long articles such as those in Wikipedia, blogs, news reports to short tweets [38, 41, 122

43]. It is generally agreed that Twitter data is inherently noisy [122]. Therefore, few previous studies have sought to reversely use knowledge provided by social media users to label or organize long articles. Unlike these methods, our proposed ASG model only used “hashtags” from tweets, and omitted the rest of noisy content. Storyline discovery is the closest research branch to our work. Shahaf et al. [100] proposed a metro-map format story generation framework, which first detected community clusters in each time window, and then grouped these communities into the stories. Yan et al. tracked the evo- lution trajectory along the timeline by emphasizing relevance, coverage, coherence and diversity of themes [121]. Mei et al. [73] proposed a HMM style probabilistic method to discover and summarize the evolutionary patterns of themes in text streams. Lappas et al. [56] designed a term burstness model to discover the temporal trend of terms in news article streams. Taking user queries as input, [65] first extracted relevant tweets and then generated storylines through graph optimization. Lin et al. [63] built a HDP (Hierarchical Dirichlet Process) model for each time epoch and then selected sentences for the storyline by considering multiple aspects such as topic relevance and coherence. Huang et al. identified local/global aspects of documents and organized these components into a storyline via optimization [42], while Zhou et al. modeled storylines as distributions over topics and named entities [127]. None of the above work jointly considered social media and news data, and all of them failed to provide a complete storyline-event-topic structure such as the one proposed in this paper.

6.3 Model

The graphical model and generative process of ASG are shown in Figure 6.2 and Algorithm 5 respectively. Each document dm is a news article embed in tweet URL, associated with tweet hash- tags m. Storyline s is a multinational variable indicating which storyline document dm belonging to, generated from multinomial distribution πs. Each storyline has one multinomial distribution ψs over events. Variable e denotes a document’s event label drawn from E-dimensional distribution ψs. Each event e has a multinomial distribution φe over K topics. As we will discuss in detail later, matrix ψs and matrix φe can reflect the relations among storyline s, event e, and topic z.

Each word w in document dm is associated with two labels: switching variable x and topic indi- cator z. x = 0 means the word w is generated from background distribution Φb, x = 1 shows the word is generated from storyline distribution Φs, and x = 2 denotes the word is generated via topic distribution Φw. Also, in the case of x taking value 2, topic z is sampled from K-dimension multi- nomial distribution φe. Under this strategy, ASG model can explain words in three different ways, from topics, from storylines, and from a background word distribution. The whole dataset has only one background word distribution Φb, while there are S different storyline-words distributions Φs and K different Φw topic distributions. This matches the intuition that a document is a mixture of background words (e.g., stop words), storyline words, and a set of aspect topics (e.g., locations). Twitter users often create hashtags to emphasize the key points of their posts. For example, a tweet 123

Figure 6.2: Graphical model for ASG. 124

ALGORITHM 5: Generation Process of ASG model Draw πs ∼ Dir(γs); Draw Φb ∼ Dir(βb); for each storyline s = 1,2,...,S do (s) Draw Φs ∼ Dir(βs); for each event e = 1,2,...,E do Draw φ(e) ∼ Dir(ω); for each topic z = 1,2,...,K do (z) Draw Φw ∼ Dir(βw); for each document m = 1,2,...,M do for each event e = 1,2,...,E do (e) Draw Λm ∼ Bernoulli(·|ϕe) Draw s ∼ Multi(πs); Generate εm = diag(Λm) × ε; (s) Draw ψs ∼ Dir(εm); (s) Draw e ∼ Multi(ψs ); Draw πx ∼ Dir(γx); for each word w in document m do Draw x ∼ Multi(πx); Draw z ∼ Multi(φ(e)); if x=0 then Draw w ∼ Φb; if x=1 then (s) Draw w ∼ Φs ; if x=2 then (z) Draw w ∼ Φw ; 125 about president election may contain hashtags such as #Hillary or #Trump. As a kind of human made labels, hashtags can be used to simplify search, indexing, and topic discovery [92]. To enable modeling of context associated with hashtags, we restrict storyline-event distribution ψs to be filtered by its hashtags Λm. Towards this goal, we set E to be the number of hashtags contained (e) in the whole dataset. Each element Λm of hashtag labels Λm is generated through Bernoulli distribution with prior probability ϕe. As shown in Line 13 and Line 14 of Algorithm 5, storyline- event vector ψs is then drawn from a Dirichlet distribution with parameters diag(Λm)×ε. Suppose there are 5 important hashtags in the datasets, then the value of E is 5 and Λm is a 5-dimension vector, within which each element is either 0 or 1. For a document associated with 2 hashtags, such as the president election tweet mentioned above, ψs is sampled with prior εm = diag(Λm) × ε = T (0,0,ε3,ε4,0) .

6.4 Model Inference and Learning

In this section, we first describe the inference process of collapsed Gibbs sampler for ASG mdoel, and then discuss the training and testing operations for the proposed model.

6.4.1 Model Inference

When the hashtag labels Λm of document m are observed, the prior parameter ϕ is d-separated from the rest of ASG model. The key to the inference is to estimate posterior distributions of hidden variables: (1) multinomial switch variable xmn for word wmn; (2) topic assignment variable zmn for word wmn when the corresponding switch variable xmn equals to 2; (3) event assignment variable em for document m; (4) storyline assignment variable sm for document m.

Gibbs sampling is chosen for the inference. First, the posterior of sm is calculated through Equation (6.1). Due to space limitations, we only present the result here.

p(sm|w,z,s¬m,e,x) v  nms V v e i ns+βs nse+ω nss+γs (6.1) = ∏  V  · E · S v i i v=1 ∑ (ns+βs) ∑ (nse+ω) ∑ (nss+γs) v=1 i=1 i=1 In the above equation, V is the size of the vocabulary, E is the number of events, and S is the v v number of storylines. ns is the number of term v choosing storyline s (when its x = 1), nms is the i number of term v choosing storyline s in the scope of document m. nse is number of document i choosing event e within storyline s, and nss is the number of documents choosing storyline i.

The inference of em is slightly different from that of sm. Each storyline s has a distribution ψs over events. Given storyline s, document chooses corresponding event e from ψs: 126

z  nem K  nz + ω  ne + ε p(e |w,z,s,e ,x) =  e  se , (6.2) m ¬m ∏  K  E z=1  z  i ∑ (ne + ω) ∑ (nse + ε) z=1 i=1 z z where K is the number of topics, ne is the counts of words choosing topic z under event e, and nem is the number of words in document m choosing topic z.

In word-level, word wmn first decides its value x: (1) when xmn = 0, word wmn is sampled from background words distribution Φb; (2) when xmn = 1, word wmn is chosen from storyline words distribution Φs, where s is the choice of document m; (3) when xmn = 2, word wmn is drawn from (z) topic distribution Φw , where z is chosen beforehand by the word:

p(x|w,z,s,e,x¬i) =  v 0 nb+βb nxm+γx  V 2 ,x = 0  v i  ∑ (nb+βb) ∑ (nxm+γx)  v=1 i=0  v 1  ns+βs nxm+γx V 2 ,x = 1 (6.3) v i ∑ (ns+βs) ∑ (nxm+γx)  v=1 i=0  v 2  nz+βw nxm+γx  V 2 ,x = 2.  v i  ∑ (nz+βw) ∑ (nxm+γx) v=1 i=0

When xmn = 2, topic assignment zmn needs to be decided first. Similar to storyline-event relation- ship, each event has a distribution over topics. Given event e, topics are chosen from multinomial distribution φe: v k n + βw n + ω p(z |w,z ,s,e,x) = k e , (6.4) i ¬i V K v z ∑ (nk + βw) ∑ (ne + ω) v=1 z=1 v where nz is the number of term v choosing topic z in the scope of the whole corpus.

6.4.2 Learning Operations

ASG can be treated as a semi-supervised model because the hidden variables are learned under the supervision of human made hashtags. In the training process, ASG model is fed with pre-given storyline labels. That is, the storyline labels of documents can be seen by the model, while event labels, topic assignments, and switch variables are inferred to maximize the likelihood of observed words and storyline labels. With trained model M , ASG model can be used to estimate the pos- terior distributions of switch variables x˜, topics z˜, events e˜, and storyline labels s˜ for new coming documents, without any pre-given labels. In order to achieve this goal, we follow the approach introduced in [106] to run the inference process on the new documents exclusively. Inference for 127

this testing process corresponds to Equation (6.1)∼(6.4) with the difference that: current Gibbs sampler is run with estimated parameters Φb, Φs, Φw, φ, ψ, and fixed hyperparameters. Taking the inference of switch variable x for example. In the initial stage, the algorithm randomly assigns switch variables to words. Then a number of Gibbs sampling updates are conduct to estimate the posterior. Similar to switch variablex ˜, the estimations of storyline labels ˜, event label e˜, and topic assignmentz ˜ can be calculated according to Equation (6.1), (6.2), and (6.4).

6.5 Experiment

In this section, we first describe our evaluation datasets, and then compare our proposed ASG model with existing state of the-art algorithms. Finally, extensive discovery results are presented by exploring the outputs of ASG model.

6.5.1 Datasets and Experiment Settings

To evaluate our proposed model and other storyline generation methods, we conduct our experi- ments on datasets containing 110,347 tweets and 27,308 news articles of 6 event subjects. These events are chosen due to their great social influence and high evolution complexity. Specifically, the datasets are collected via the following steps: 1) For each event, filter Twitter data through Twitter REST API using event-relevant keywords and hashtags provided by domain experts. 2) Extract URL links embedded in the tweets, download documents associated with these links. 3) Conduct stemming and lemmatization. Note that, we didn’t remove stop words in the preprocess- ing step, since our proposed ASG model could treat these words as background words. Table 6.1 lists the statistical information about evaluation datasets. We asked 12 human annotators to create labels for these documents: 1) select storyline label from the six event subjects for each document, and 2) assign event label from 21 event types provided by domain experts for each document. One document is associated to one storyline label and one event label. We divided the whole dataset evenly into 4 parts and assigned to 4 groups of annotators. Within each group, a label will be included in the groud-truth only if it is chosen by at least 2 out of the 3 annotators.

In our evaluation, we use weak symmetric priors for all Dirichlet parameters: γs = 0.1, γx = 0.3, βs = 0.0001, βw = βb = 0.001, ε = ω = 0.01. The number of topics K is 50, the number of events is decided by the total number of hashtags contained in the datasets, and the storyline number is set to be 6. The Gibbs sampler is run for 500 iterations with the first 100 iterations as built-in period. We compare our ASG model with the following methods. 1) Random: The random method selects documents randomly for storylines and events. 2) K-means: This method identifies storylines and events by K-means clustering. The number of clusters is set to be 6 (number of ground truth storylines). 3) LSA: LSA is a method of analyzing relationships between a set of documents and 128

Dataset #Tweets #News Country World Cup 11K 3352 Brazil President Election 9K 2491 Colombia Security Protests 12K 3787 Venezuela Education March 14K 4841 Chile Iguala Kidnap 25K 5853 Mexico Paris Attack 37K 6984 France

Table 6.1: Detailed information of datasets terms they contain, which uses SVD to reduce the dimension of features with high similarity. In our experiment, the number of clusters is set to be 6. 4) LDA: This method applies standard LDA twice to discovery storylines and events. First, topics found in the first run are treated as storylines (K1 = 6). Then, within each obtained topic, another LDA is run to distinguish the events (K2 = 50). 5) ASGH: This approach is a variant of ASG that ASGH model doesn’t use Twitter hashtags for generation of events. 6) ASGB: This method is another variant of ASG that ASGB model excludes the background distribution Φb and has a symmetric beta prior γx = 0.5.

6.5.2 Experiment Results

Table 6.2 reports the ACC and NMI results for seven methods. ACC denotes the performance on storyline-level, and NMI can better evaluate the results on event-level. Random selection is the worst performer that its ratio of correct guess on storylines (ACC) is around 1/6, and the probability of successful guess in event-level (NMI) is almost equal to 0. Our ASG model outperforms the baseline methods in most of the fields, except on “World Cup”, where ASGB achieves the best outcome. Two interesting observations can be made from Table 6.2. First, Bayesian models perform better than clustering methods. Two clustering methods K-means and LSA are much better than random selection method, however, significantly poorer than Bayesian models LDA, ASGH, ASGB, and ASG. This is because these methods are simply built on word similarities, without further knowl- edge on relationship between words (hidden topics). Second, “Hashtag” factor is more important than “background words” factor. Compared to ASGH model, ASGB model obtain the perfor- mance closer to ASG model, which indicates utilization of hashtags indeed improve performance significantly. As later shown in Table 6.3, the scheme of background distribution can remove stop words and commonly used words, which therefore benefit the overall performance of ASG model. Table 6.3 lists top 15 terms of background, storyline, and topic words learned by ASG model. In general, the ration of words assigned to background distribution, storyline distributions, and topic distributions are 33%, 27%, and 38% respectively. We discuss the three types of words as follows. 1) Background words. There is only one back- ground word distribution over the whole corpus. Most of the top ranked background words are 129

Table 6.2: Performance comparison among storyline detection methods (ACC, NMI)

World Cup President Election Security Protests Education March Acc NMI Acc NMI Acc NMI Acc NMI Random 0.14411 2.84E- 0.153098 0.000515 0.156086 0.00158 0.17796 0.000962 05 K-means 0.342921 0.02507 0.302279 0.024265 0.354775 0.16668 0.323172 0.117916 LSA 0.426756 0.000448 0.40866 0.009094 0.36727 0.117569 0.378563 0.00213 LDA 0.498904 0.280061 0.457191 0.141977 0.461266 0.271075 0.482378 0.243714 ASGH 0.512481 0.319418 0.484782 0.152756 0.517365 0.313238 0.495184 0.248761 ASGB 0.623336 0.399913 0.585973 0.181412 0.597942 0.394318 0.505529 0.268152 ASG 0.598904 0.380061 0.622446 0.210056 0.617287 0.528883 0.290283 0.415664

stop words such as “en”(Spanish), “the”, “de”(Spanish), or common used words such as “video”, “http”, “com”. 2) Storyline words. Storyline words are not used as commonly as background words, but still across a broad range of documents within the storyline. Storyline 1 seems to be related to “Paris attack”, since it contains word “paris” and some other event words such as “the- ather”, “hurt”, and “shoot”. Similarly, Storyline 2 is about kidnap in Mexico, which is consisted of words such as “mexico”, “secuestrado” (Spanish of kidnap), and “estudiante”(Spanish of student). 3) Topic words. Compared to storyline words and background words, topic words are narrowed to more specifical context. For example, Topic 16 is a set of words describing government, with words such as “office”, “police”, and “gobierno” (Spanish of government), while Topic 21 is about people names, which includes words such as “dilma” (Brazilian president), “santos” (Colombian president), and “zuluaga” (a Colombian economist). One of the major contributions of ASG model is that: ASG is the first storyline generation model that can quantitatively measure the relations among storylines, events, and topics. In practice, these knowledge are learned from storyline-event matrix ψ and event-topic matrix φ. Figure 6.3 illustrates the usage of these matrices. Corresponding to Figure 6.1 in the introduction section, here triangles in the central parts denote storylines, the circles around the triangles represent event types, and the outermost squares stand for topics. The thickness of the edge between a triangle (storyline s) and a circle (event e) is proportional to the corresponding value of ψse, and similarly, the thickness of the edge between a circle (event e) and a square (topic z) is corresponding to the value of φez. By correlating Table 6.3 with Figure 6.3, some interesting patterns can be obtained. 1) Mapping ground-truth labels. By referring to the storyline words distribution in Table 6.3, we mapped the storylines to real cases: the blue triangle should be the “Kidnap” storyline, with storyline words such as “mexico” and “students”; the red triangle is the “Paris attack”, because words such as “terrorist” and “Paris” are top ranked in the storyline-words list. Also, as mentioned above in Table 6.3, topic 16 is related to “government”, and topic 21 is related to “people names”. 2) Relations between topics and event types. By referring to the values of matrix φez, the result event types can be inferred through their connected topics. For example, in Figure 6.3, event 2 130

Table 6.3: Example of background/storyline/topic words learned by ASG model.

Background Storyline 1 Storyline 2 Topic 16 Topic 21 la 0.0093 paris 0.0104 presidente 0.0058 office 0.0076 pineda 0.0030 el 0.0087 help 0.0057 estudiante 0.0055 view 0.0074 esporte 0.0026 en 0.0058 theater 0.0049 EPN 0.0054 gobierno 0.0072 abarca 0.0025 the 0.0057 http 0.0043 news 0.0053 police 0.0061 right 0.0023 video 0.0053 shoot 0.0039 mar 0.0051 time 0.0037 zuluaga 0.0023 com 0.0051 news 0.0034 mexico 0.0049 comment 0.0034 dilma 0.0022 de 0.0038 hurt 0.0032 home 0.0047 city 0.0032 trial 0.0019 www 0.0038 hall 0.0030 iguala 0.0041 voice 0.0029 hecha 0.0018 marzo 0.0037 kill 0.0029 youtube 0.0037 new 0.0027 santos 0.0017 Twitter 0.0036 world 0.0029 gobierno 0.0035 report 0.0027 vaga 0.0016 filter 0.0034 police 0.0029 toda 0.0035 tudo 0.0027 jamariro 0.0016 apply 0.0034 terrorist 0.0026 nacional 0.0035 medium 0.0026 soho 0.0014 email 0.0033 people 0.0026 poltica 0.0035 share 0.0025 bio 0.0012 http 0.0031 maduro 0.0026 kidnap 0.0030 arrest 0.0024 bundchen 0.0011 da 0.0031 gun 0.0025 secuestrado0.0030 agosto 0.0023 lula 0.0011 is connected to topic 16 (government) and topic 21 (people names). Among the 21 event types, the combination of the two topics is closest to event type “investigation”. Similarly, event 1 is recognized as “attack”, event 6 means “protest”, and event 8 denotes “terrorism”. 3) Relations between storylines and event types. The impact of event type factors to storylines can be analyzed through values of matrix ψse. As can be seen from Figure 6.3, both the storyline “Paris attack” and “Kidnap” are connected to event “investigation” and “attack”, the difference is that: “Paris attack” storyline has a stronger connection on event “attack” over event “investigation”, while storyline “Kidnap” has balanced weights on the two events. Besides the common shared events, “Paris attack” owns a private event type “terrorism”, and storyline “Kidnap” has one exclusive event type “protest”. These above mentioned observations match the real world truth well, and therefore directly implies ASG model is a useful tool to identify and interpret the hidden factors that driving the evolution of events.

6.6 Conclusion

In this paper, we have proposed a hierarchical Bayesian model named ASG to detect storylines. The novelty of ASG model lies on three aspects. First, ASG is the only model with three-layer structure for the task of storyline generation. Most previous researchers thought Twitter data are too noisy to achieve good performance in topic models. On the contrary, our ASG model ignores the noisy context of tweets and utilizes their promising labels made by Twitter users (hashtags) to improve the performance. Besides, through special designed data structures, ASG is capable to measure the hidden relations among storylines, events, and topics. We present the results of 131

Topic 11 Topic 3

Topic 21 Topic 16 Topic 12 1 Attacks 2 Protest

Kdinap Topic 7 Topic 4 Paris Attacks

Topic 6 6 Terrorism

8 Investigation

Figure 6.3: Relations among storyline, event type, and topics. The triangles are symbols for story- lines, the circles denote event type, and the squares indicate topics. applying ASG model to real-word events and show its effectiveness over non-trivial baselines. Based on the outputs of ASG model, further analysis can be made to understand the underlying factors inside the documents, which can lead to a broad spectrum of future research. Chapter 7

Completed Work and Future Work

The proposed research aims to discover underlying topics, events, and stories based on social media data. Five major approaches are included in this dissertation: targeted-domain Twitter event detection, topical interaction between social media and other data sources, discriminative learning among multiple datasets, connecting massive documents into storylines, and improving disease simulation with knowledge learned from social media. For the problem of targeted-domain Twitter event detection, this paper presents a novel semi- supervised approach for detecting spatial events of user targeted interests. A label generation method is developed to utilize knowledge learned from news to automatically generate tweet labels of a targeted domain. To extract spatial events based on these domain-related tweets, a multinomial spatial scan method is proposed, which is capable of simultaneously capturing all geographical information of tweets. Extensive experimental studies are conducted on Twitter data collected in 10 Latin American countries. The results demonstrate the effectiveness and efficiency of our proposed approach. For the problem of topical interaction between social media and other data sources, this paper develops a new generative model to jointly consider the topics in news documents and Twitter streams. An efficient Gibbs sampling inference algorithm is derived to optimize the model pa- rameters. Utilizing the learned posteriors of the model, we are able to discover various factors, such as milestone document, topic differences, and topic influence. Extensive empirical studies demonstrate the effectiveness of the new approach by comparing it with representative methods. For the problem of discriminative learning, this work presents a novel Bayesian learning frame- work to identify common and distinctive topics for multiple datasets. Specifically, this framework has a hierarchy structure: multiple datasets share one global topic-term distribution, and each in- dividual dataset has its own topics. Besides topic index, each word is also associated with the choice variable that decides whether the word is generated from the shared common topics or from a self-owned specific topic. The empirical results demonstrate that the proposed method can beat the baselines in various application scenario.

132 133

For the problem of epidemics modeling in flu forecasting based on social media data, we proposed a novel bi-space framework to combine the strengths of computational epidemiology and social media mining. In the social media space, a LDA-like graphical model learns the healthy states for each user. The aggregated information are used as input parameters for simulation space, and the optimized results from simulation can be used to improve the prior knowledge in social media space. Extensive experiments based on various states and flu seasons show the advantages of integrating computational epidemiology and social media mining. For the problem of storyline generation via help from social media data, this work presents a novel Bayesian learning framework to infer the inner relations between seemly unrelated documents. Specifically, this framework assumes a three-layer structure of topic-event-storyline for documents, and considers relations as the different combinations of underlying topics and events, which can be learned from the posteriors of hidden variables. The empirical results demonstrated that it can effectively detect storyline clusters, and outperform competing methods by a substantial margin on both accuracy and NMI.

7.1 Research Tasks

The major research tasks are described as follows. The current status of these tasks is listed in Table 7.1.

7.1.1 Targeted-domain Twitter Event Detection

• Development of automatic label generation via news data (A1). We propose a semi-supervised approach for targeted domain event detection in Twitter. Different from traditional methods, our method requires no human input to train labeling set.

• Design of a novel text classifier for Twitter data (A2). Besides plain texts, we consider multiple Twitter features such as hashtags, mentions, and replies to cluster related tweets into groups to improve the classification performance.

• An innovative multinomial spatial-scan location estimation algorithm (A3). Based on the event-related tweets identified by the classifier, this method jointly considers all the infor- mation from tweets, such as GPS, profile location, and location mentions to improve the estimation of event location.

7.1.2 News & Social Media Influence and Interaction

• A novel generative framework for jointly modeling multiple datasets (B1). We propose an asymmetric framework to transfer topical knowledge learned from news to Twitter, and 134

meanwhile improve the topic modeling performance in Twitter.

• An effective sampling algorithm for model parameter inference (B2). Gibbs sampling algo- rithm is used for parameter estimation in the proposed model.

• Evaluation of the proposed model on the real world datasets (B3). We use the outputs of the proposed method to explore the topical interactions between news and Twitter data.

7.1.3 Storyline Generation via help from Social Media

• Proposal of Bayesian networks for Storyline Generation (C1). In the proposed model, we build a three-layer generative model for learning topics, events, and storyline labels for docu- ments. The relations between different layers are modeled through two probability matrices.

• Developing an Gibbs sampling inference method to estimate parameters (C2). Gibbs sam- pling was chosen for the inference and parameter estimation because of its high accuracy in estimations for LDA-like graphical model.

• Evaluate the proposed model on real world datasets, and compare it with baselines (C3).

7.1.4 Epidemic Simulation with updates from Social Media

• Proposing a novel integrated framework for computational epidemiology and social media mining (D1): The existing approaches in computational epidemiology and social media min- ing focus on different but complementary aspects. The former mainly focuses on modeling the underlying mechanisms for flu spreading, while the latter can provide timely and detailed disease data. Our proposed framework can integrate the strengths of both approaches.

• Developing an efficient parameter inference algorithm (D2): To achieve such integration, we provide parameter estimations for both spaces.

• Evaluation on datasets and baselines (D3): Add new dataset to evaluate the current methods and include more baselines for comparison. In addition, consider more types of disease epidemics other than influenza.

7.1.5 Learning Common and Distinctive Topics from Multiple Datasets

• Proposing a novel Bayesian model to simultaneously identify common and distinct topics among different datasets (E1): Previous related work is either lack of specially designed dis- criminative learning scheme, or can not handle the multiple datasets. The proposed CDTM model utilizes a hierarchical Bayesian model that consider the common and distinctive topics within one framework. 135

Table 7.1: Research tasks and status

Task Description Status Research Area A Targeted-domain Twitter Event Detection Completed A1 Development of automatic label generation via news data Completed A2 Design of a novel text classifier for Twitter data Completed A3 An innovative multinomial spatial-scan location estimation algorithm Completed Research Area B News & Social Media Interaction Completed B1 A novel generative framework for jointly modeling multiple datasets Completed B2 An effective sampling algorithm for model parameter inference Completed B3 Evaluation of the proposed model on the real world dataset Completed Research Area C Storyline Generation via help from Social Media Completed C1 Proposal of Bayesian networks for Storyline Generation Completed C2 Developing a Gibbs sampling inference algorithm to estimate parameters Completed C3 Evaluate the proposed model and baselines on real world datasets Completed Research Area D Epidemic Simulation with updates from Social Media Completed D1 Proposal of Social media embedded epidemics model Completed D2 Developing an efficient parameter inference algorithm Completed D3 Evaluation on datasets and baselines Completed Research Area E Discriminative learning to identify common and distinctive topics Completed E1 Proposal of CDTM model for discriminative learning Completed E2 Inference algorithm via Gibbs sampling Completed E3 Evaluate on various datasets and compare performance for baselines Completed

• Providing an efficient Gibbs sampling inference for the CDTM model (E2): Gibbs sampling is utilized to estimate the parameters of the CDTM model due to its high accuracy when performing estimations for LDA-like graphical models.

• Conducting extensive experiments to compare proposed CDTM model with those of the most important existing state-of-the-art algorithms on real-world datasets (E3).

7.2 Schedule

The proposed work includes 16 specific tasks, including thesis revision. Within these tasks, A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2 have been accomplished before preliminary exam. Task E1, E2, E3 were finished by research defense. The task D2 and D3 have been finished by Dec of 2017. The schedule of the proposed research is illustrated in Table 7.1. 136

7.3 Publications and submissions

7.3.1 Current Publications

1. Ting Hua, Xuchao Zhang, Wei Wang, Chang-Tien Lu, and Naren Ramakrishnan. "Au- tomatical Storyline Generation with Help from Twitter." In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2383-2388. ACM, 2016.

2. Ting Hua, Feng Chen, Liang Zhao, Chang-Tien Lu, and Naren Ramakrishnan. "Automatic targeted-domain spatiotemporal event detection in twitter." GeoInformatica 20, no. 4 (2016): 765-795.

3. Liang Zhao∗, Ting Hua∗, Chang-Tien Lu, and Ing-Ray Chen. "A topic-focused trust model for Twitter." Computer Communications 76 (2016): 1-11. (∗) The two authors contribute equally.

4. Ting Hua, Ning Yue, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. "Topical anal- ysis of interactions between news and social media." In Proceedings of the 30th AAAI Con- ference on Artificial Intelligence, pp. 2964-2971. 2016.

5. Ting Hua, Liang Zhao, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. “How events unfold: spatiotemporal mining in social media.” SIGSPATIAL Special 7, no. 3 (2016): 19- 25.

6. Liang Zhao, Feng Chen, Jing Dai, Ting Hua, Chang-Tien Lu, and Naren Ramakrishnan. "Unsupervised spatial event detection in targeted domains with applications to civil unrest modeling." PloS one 9, no. 10 (2014): e110206.

7. Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, Nathan Self, Rupinder Khandpur, Parang Saraf, Wei Wang, Jose Cadena, Anil Vullikanti, Gizem Korkmaz, Chris Kuhlman, Achla Marathe, Liang Zhao, Ting Hua, Feng Chen, et al. "’Beating the news’ with EM- BERS: forecasting civil unrest using open source indicators." In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1799-1808. ACM, 2014.

8. Ting Hua, Chang-Tien Lu, Ramakrishnan Naren, Feng Chen, Arredondo J, Mares David, Summers K. “Analyzing Civil Unrest through Social.” IEEE Computer, Vol.46, No.12, pp.82-86, 2013.

9. Ting Hua, Feng Chen, Liang Zhao, Chang-Tien Lu, and Naren Ramakrishnan. "STED: semi-supervised targeted-interest event detectionin in twitter." In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1466-1469. ACM, 2013. 137

7.3.2 Submitted and In-preparation papers

1. Ting Hua, Chandan Reddy, Jaegul Choo, and Chang-Tien Lu. “Common and Distinctive Learning via Bayesian Inference.” ACM Transactions on Knowledge Discovery from Data (TKDD), ACM. (Submitted)

2. Ting Hua, Liang Zhao, Lei Zhang, Chandan Reddy, Chang-Tien Lu, and Naren Ramakrish- nan. “Seeding epidemic simulation with updates from Social media data.” In Proceedings of TIST journal. ACM, 2017. (In preparation)

3. Ting Hua, Chandan Reddy, Chang-Tien Lu, and Naren Ramakrishnan. Generative adver- sarial imitation learning for storyline generation. In Proceedings of the 24th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD 2018) . ACM. In preparing, 2018. (In preparation) Chapter 8

Reference

138 Bibliography

[1] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Predicting flu trends using twitter data. In Computer Communications Workshops (INFOCOM WKSHPS), pages 702– 707. IEEE, 2011.

[2] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu. Online social networks flu trend tracker: a novel sensory approach to predict flu trends. In Proceedings of the 5th Interna- tional Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC), pages 353–368. Springer, 2012.

[3] C. L. Barrett, R. J. Beckman, M. Khan, V. Anil Kumar, M. V. Marathe, P. E. Stretz, T. Dutta, and B. Lewis. Generation and analysis of large synthetic social contact networks. In Pro- ceedings of the 41st Winter Simulation Conference (WSC), pages 1003–1014. Winter Simu- lation Conference, 2009.

[4] C. L. Barrett, K. R. Bisset, S. G. Eubank, X. Feng, and M. V. Marathe. Episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks. In Proceedings of the 22nd ACM/IEEE conference on Supercomputing (ICS), pages 1–12. IEEE, 2008.

[5] H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event identifi- cation on twitter. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media, pages 438–441. AAAI, 2011.

[6] R. Beckman, K. R. Bisset, J. Chen, B. Lewis, M. Marathe, and P. Stretz. Isis: A networked- epidemiology based pervasive web app for infectious disease pandemic planning and re- sponse. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1847–1856. ACM, 2014.

[7] I. Bhattacharya. Google trends for formulating gis mapping of disease outbreaks in india. In International Journal of Geoinformatics, volume 9. Springer, 2013.

[8] K. R. Bisset, J. Chen, X. Feng, V. Kumar, and M. V. Marathe. Epifast: a fast algorithm for large scale realistic epidemic simulations on distributed memory systems. In Proceedings of the 23rd international conference on Supercomputing (ICS), pages 430–439. ACM, 2009.

139 140

[9] D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd Interna- tional Conference on Machine Learning, pages 113–120. ACM, 2006.

[10] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022, 2003.

[11] R. D. Bock and M. Aitkin. Marginal maximum likelihood estimation of item parameters: Application of an em algorithm. volume 46, pages 443–459. Springer, 1981.

[12] T. Brants, F. Chen, and A. Farahat. A system for new event detection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 330–337. ACM, 2003.

[13] S. Brennan, A. Sadilek, and H. Kautz. Towards understanding global spread of disease from everyday interpersonal interactions. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI), pages 2783–2789. AAAI, 2013.

[14] D. Cai, X. He, X. Wu, and J. Han. Non-negative matrix factorization on manifold. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on Data Mining, pages 63–72. IEEE, 2008.

[15] G. Casella and E. I. George. Explaining the gibbs sampler. volume 46, pages 167–174. Taylor & Francis, 1992.

[16] M. Cataldi, L. Di Caro, and C. Schifanella. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the 10th International Workshop on Multimedia Data Mining, pages 1–10. ACM, 2010.

[17] J. Chang, J. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169–178. ACM, 2009.

[18] C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of Neural Information Process- ing Systems (NIPS), volume 19, pages 241–248, 2006.

[19] L. Chen, K. T. Hossain, P. Butler, N. Ramakrishnan, and B. A. Prakash. Flu gone viral: Syndromic surveillance of flu on twitter using temporal topic models. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), pages 755–760. IEEE, 2014.

[20] B. Christopher. Pattern recognition and machine learning. pages 93–94. Springer, New York, 2007.

[21] D. A. Cohn and T. Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In Advances in neural information processing systems (NIPS), pages 430–436, 2001. 141

[22] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages. In Pro- ceedings of the first workshop on social media analytics (SOMA), pages 115–122. ACM, 2010.

[23] S. Eubank, H. Guclu, V. A. Kumar, M. V. Marathe, A. Srinivasan, Z. Toroczkai, and N. Wang. Modelling disease outbreaks in realistic urban social networks. volume 429, pages 180–184. Nature Publishing Group, 2004.

[24] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu. Parameter free bursty events detection in text streams. In Proceedings of the 31st international conference on Very large data bases, pages 181–192. VLDB Endowment, 2005.

[25] T. Ge, W. Pei, H. Ji, S. Li, B. Chang, and Z. Sui. Bring you to the past: Automatic generation of topically relevant event chronicles. In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), pages 575–585, 2015.

[26] M. P. Girard, J. S. Tam, O. M. Assossou, and M. P. Kieny. The 2009 a (h1n1) influenza virus pandemic: A review. volume 28, pages 4895–4902. Elsevier, 2010.

[27] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.

[28] C. Groendyke, D. Welch, and D. R. Hunter. A network-based analysis of the 1861 hagelloch measles data. volume 68, pages 755–765. Wiley Online Library, 2012.

[29] G. Heinrich. Parameter estimation for text analysis. Technical report, Technical report, 2005.

[30] G. Heinrich. Parameter estimation for text analysis. In University of Leipzig, Tech. Rep, 2008.

[31] H. Hirose and L. Wang. Prediction of infectious disease spread using twitter: A case of influenza. In Proceedings of the 55th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), pages 100–105. IEEE, 2012.

[32] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems(NIPS), pages 856–864, 2010.

[33] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. volume 14, pages 1303–1347, 2013.

[34] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the international ACM conference on Research and development in information retrieval (SIGIR), pages 50– 57. ACM, 1999.

[35] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, pages 80–88. ACM, 2010. 142

[36] L. Hong, B. Dom, S. Gurumurthy, and K. Tsioutsiouliklis. A time-dependent topic model for multiple text streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 832–840. ACM, 2011.

[37] M. Hu, S. Liu, F. Wei, Y. Wu, J. Stasko, and K.-L. Ma. Breaking news on twitter. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2751–2754. ACM, 2012.

[38] Y. Hu, A. John, F. Wang, and S. Kambhampati. Et-lda: Joint topic modeling for aligning events and their twitter feedback. In AAAI, volume 12, pages 59–65, 2012.

[39] T. Hua, F. Chen, L. Zhao, C.-T. Lu, and N. Ramakrishnan. Sted: semi-supervised targeted- interest event detectionin in twitter. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1466–1469. ACM, 2013.

[40] T. Hua, C.-T. Lu, N. Ramakrishnan, F. Chen, J. Arredondo, D. Mares, and K. Summers. Analyzing civil unrest through social media. Computer, 46(12):80–84, 2013.

[41] T. Hua, N. Yue, F. Chen, C.-T. Lu, and N. Ramakrishnan. Topical analysis of interactions between news and social media. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 2964–2971, 2016.

[42] L. Huang and L. Huang. Optimized event storyline generation based on mixture-event- aspect model. In Proceedings of the 12th Conference on Empirical Methods in Natural Language Processing, pages 726–735, 2013.

[43] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pages 775–784. ACM, 2011.

[44] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer, 1998.

[45] A. M. Kaplan and M. Haenlein. The early bird catches the news: Nine things you should know about micro-blogging. volume 54, pages 105–113. Elsevier, 2011.

[46] E. H. Kaplan, D. L. Craft, and L. M. Wein. Emergency response to a smallpox attack: the case for mass vaccination. volume 99, pages 10935–10940. National Acad Sciences, 2002.

[47] N. Kawamae. Trend analysis model: trend consists of temporal words, topics, and times- tamps. In Proceedings of the 4th ACM international conference on Web search and data mining, pages 317–326. ACM, 2011. 143

[48] H. Kim, J. Choo, J. Kim, C. K. Reddy, and H. Park. Simultaneous discovery of com- mon and discriminative topics via joint nonnegative matrix factorization. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 567–576. ACM, 2015.

[49] M. Krieck, J. Dreesman, L. Otrusina, and K. Denecke. A new age of public health: Identify- ing disease outbreaks by analyzing tweets. In Proceedings of Health Web-Science Workshop, ACM Web Science Conference, 2011.

[50] H. W. Kuhn. The hungarian method for the assignment problem. In Naval research logistics quarterly (NRL), volume 2, pages 83–97. Wiley Online Library, 1955.

[51] M. Kulldorff. Spatial scan statistics: models, calculations, and applications. In Scan statis- tics and applications, pages 303–322. Springer, 1999.

[52] G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th annual ACM SIGIR Conference on Research and development in information retrieval, pages 297–304. ACM, 2004.

[53] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web, pages 591–600. ACM, 2010.

[54] S. Lacoste-Julien, F. Sha, and M. I. Jordan. Disclda: Discriminative learning for dimension- ality reduction and classification. In Advances in neural information processing systems (NIPS), pages 897–904, 2009.

[55] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright. Convergence properties of the nelder–mead simplex method in low dimensions. volume 9, pages 112–147. SIAM, 1998.

[56] T. Lappas, B. Arai, M. Platakis, D. Kotsakos, and D. Gunopulos. On burstiness-aware search for document sequences. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 477–486. ACM, 2009.

[57] T. Lappas, M. R. Vieira, D. Gunopulos, and V. J. Tsotras. On the spatiotemporal bursti- ness of terms. In Proceedings of the VLDB Endowment, volume 5, pages 836–847. VLDB Endowment, 2012.

[58] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (NIPS), pages 556–562, 2001.

[59] J. Lehmann, B. Gonçalves, J. J. Ramasco, and C. Cattuto. Dynamical classes of collective attention in twitter. In Proceedings of the 21st International Conference on World Wide Web, pages 251–260. ACM, 2012. 144

[60] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 497–506. ACM, 2009.

[61] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata. Detecting outliers: Do not use stan- dard deviation around the mean, use absolute deviation around the median. In Journal of Experimental Social Psychology, volume 49, pages 764–766. Elsevier, 2013.

[62] C. Li, A. Sun, and A. Datta. Twevent: Segment-based event detection from tweets. In Pro- ceedings of the 21st ACM International Conference on Information and Knowledge Man- agement, pages 155–164. ACM, 2012.

[63] J. Li and S. Li. Evolutionary hierarchical dirichlet process for timeline summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 556–560, 2013.

[64] R. Li, K. H. Lei, R. Khadiwala, and K. C.-C. Chang. Tedas: A twitter-based event de- tection and analysis system. In Proceedings of the 28th International Conference on Data Engineering, pages 1273–1276. IEEE, 2012.

[65] C. Lin, C. Lin, J. Li, D. Wang, Y. Chen, and T. Li. Generating event storylines from mi- croblogs. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 175–184. ACM, 2012.

[66] S. Lin, F. Wang, Q. Hu, and P. S. Yu. Extracting social events for learning better information diffusion models. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365–373. ACM, 2013.

[67] T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: mining focused topics and focused terms in short text. In Proceedings of the 23rd international conference on World wide web, pages 539–550. International World Wide Web Conferences Steering Committee, 2014.

[68] Y.-R. Lin, D. Margolin, B. Keegan, and D. Lazer. Voices of victory: A computational focus group framework for tracking opinion shift in real time. In Proceedings of the 22nd International Conference on World Wide Web, pages 737–748. International World Wide Web Conferences Steering Committee, 2013.

[69] Z. Ma, A. Sun, Q. Yuan, and G. Cong. Tagging your tweets: A probabilistic modeling of hashtag annotation in twitter. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 999–1008. ACM, 2014.

[70] C. Macdonald, R. McCreadie, M. Osborne, I. Ounis, S. Petrovic, and L. Shrimpton. Can twitter replace newswire for breaking news. In Proceedings of the 7th International AAAI Conference on Weblogs and Social Media, 2013. 145

[71] N. Mark. Fast algorithm for detecting community structure in networks. In Proceedings of Physical Review E, volume 69. APS, 2004.

[72] N. Mark. Modularity and community structure in networks. In Proceedings of the National Academy of Sciences, volume 103, pages 8577–8582. National Acad Sciences, 2006.

[73] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: An exploration of temporal text mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 198–207. ACM, 2005.

[74] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 262–272. ACL, 2011.

[75] B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL, pages 777–782. ACL, 2013.

[76] J. D. Murray. Mathematical biology i: an introduction. In interdisciplinary applied mathe- matics, volume 17. Springer, 2002.

[77] S. Muthiah, B. Huang, J. Arredondo, D. Mares, L. Getoor, G. Katz, and N. Ramakrish- nan. Planned protest modeling in news and social media. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 3920–3927. AAAI, 2015.

[78] D. B. Neill. Fast subset scan for spatial pattern detection. In Journal of the Royal Statisti- cal Society: Series B (Statistical Methodology), volume 74, pages 337–360. Wiley Online Library, 2012.

[79] J. A. Nelder and R. Mead. A simplex method for function minimization. volume 7, pages 308–313. Oxford University Press, 1965.

[80] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan. Nested hierarchical dirichlet processes. In IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), volume 37, pages 256–270. IEEE, 2015.

[81] S. J. Pan and Q. Yang. A survey on transfer learning. In IEEE Transactions on Knowledge and Data Engineering, volume 22, pages 1345–1359. IEEE, 2010.

[82] M. Paul and R. Girju. A two-dimensional topic-aspect model for discovering multi-faceted topics. In Association for the Advancement of Artificial Intelligence (AAAI), volume 51, page 36, 2010.

[83] M. J. Paul and M. Dredze. A model for mining public health topics from twitter. volume 11, pages 16–6, 2012. 146

[84] S. Petrovic,´ M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 181–189. ACL, 2010.

[85] X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen, S. Horiguchi, and Q.-T. Ha. A hidden topic-based framework toward building applications with short web documents. Knowledge and Data Engineering, IEEE Transactions on, 23(7):961–976, 2011.

[86] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on World Wide Web, pages 91–100. ACM, 2008.

[87] A.-M. Popescu, M. Pennacchiotti, and D. Paranjpe. Extracting events and event descriptions from twitter. In Proceedings of the 20th International Conference Companion on World Wide Web, pages 105–106. ACM, 2011.

[88] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 569–577. ACM, 2008.

[89] M. Purver and S. Battersby. Experimenting with distant supervision for emotion classifica- tion. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–491. ACL, 2012.

[90] A. K. Qin, V. L. Huang, and P. N. Suganthan. Differential evolution algorithm with strategy adaptation for global numerical optimization. volume 13, pages 398–417. IEEE, 2009.

[91] M. Rabinovich and D. M. Blei. The inverse regression topic model. In Proceedings of International Conference on Machine Learning (ICML), pages 199–207. IEEE, 2014.

[92] D. Ramage, S. T. Dumais, and D. J. Liebling. Characterizing microblogs with topic models. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, pages 130–137, 2010.

[93] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of Conference on Em- pirical Methods on Natural Language Processing (EMNLP), pages 248–256. ACL, 2009.

[94] D. Ramage, C. D. Manning, and S. Dumais. Partially labeled topic models for interpretable text mining. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 457–465. ACM, 2011.

[95] A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimen- tal study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1524–1534. ACL, 2011. 147

[96] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1104–1112. ACM, 2012. [97] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelli- gence (UAI), pages 487–494. AUAI Press, 2004. [98] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860. ACM, 2010. [99] B. Settles. Active learning literature survey. In University of Wisconsin, Madison, vol- ume 52, page 11, 2010. [100] D. Shahaf, J. Yang, C. Suen, J. Jacobs, H. Wang, and J. Leskovec. Information cartogra- phy: creating zoomable, large-scale maps of information. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097– 1105. ACM, 2013. [101] L. Shou, Z. Wang, K. Chen, and G. Chen. Sumblr: continuous summarization of evolving tweet streams. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 533–542. ACM, 2013. [102] A. Signorini, A. M. Segre, and P. M. Polgreen. The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic. In PloS one, volume 6, page e19467. Public Library of Science, 2011. [103] H. F. Silver. Compare & contrast: Teaching comparative thinking to strengthen student learning. pages 1–2. Association for Supervision & Curriculum Development, 2010. [104] T. Štajner, B. Thomee, A.-M. Popescu, M. Pennacchiotti, and A. Jaimes. Automatic selec- tion of social media responses to news. In Proceedings of the 19th ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pages 50–58. ACM, 2013. [105] K. Stevens, P. Kegelmeyer, D. Andrzejewski, and D. Buttler. Exploring topic coherence over many models and many topics. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), pages 952–961. ACL, 2012. [106] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306–315. ACM, 2004. [107] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Sharing clusters among related groups: Hierarchical dirichlet processes. In Advances in neural information processing systems (NIPS), pages 1385–1392, 2005. 148

[108] M. Tsagkias, M. de Rijke, and W. Weerkamp. Linking online news and social media. In Proceedings of the fourth ACM International Conference on Web Search and Data Mining, pages 565–574. ACM, 2011.

[109] M. Tsytsarau, T. Palpanas, and M. Castellanos. Dynamics of news events and social media reaction. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 901–910. ACM, 2014.

[110] Z. Tufekci and C. Wilson. Social media and the decision to participate in political protest: Observations from Tahrir Square. In Journal of Communication, volume 62, pages 363–379. Wiley Online Library, 2012.

[111] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10:178–185, 2010.

[112] J. Vosecky, D. Jiang, K. W.-T. Leung, and W. Ng. Dynamic multi-faceted topic discovery in twitter. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pages 879–884. ACM, 2013.

[113] E. Vynnycky and R. White. An introduction to infectious disease modelling. Oxford Uni- versity Press, 2010.

[114] H. M. Walker. Studies in the history of the statistical method. pages 24–25. The Williams and Wilkins Company, 1931.

[115] J. Wang, W. Tong, H. Yu, M. Li, X. Ma, H. Cai, T. Hanratty, and J. Han. Mining multi-aspect reflection of news events in twitter: Discovery, linking and presentation. In Proceedings of IEEE International Conference on Data Mining (ICDM), pages 429–438. IEEE, 2015.

[116] X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of top- ical trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pages 424–433. ACM, 2006.

[117] M. Welling and Y. W. Teh. Hybrid variational/gibbs collapsed inference in topic models. In In Proceedings of the Conference on Uncertainty in Artifical Intelligence (UAI, pages 587–594, 2008.

[118] J. Weng and B.-S. Lee. Event detection in twitter. In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media, pages 401–408. AAAI, 2011.

[119] C. Wilson and A. Dunn. Digital media in the Egyptian revolution: Descriptive analysis from the Tahrir data sets. In International Journal of Communication, volume 5, pages 1248–1272. USC Annenberg Press, 2011.

[120] S. J. Wright and J. Nocedal. Numerical optimization. volume 35. Springer Science, 1999. 149

[121] R. Yan, X. Wan, J. Otterbacher, L. Kong, X. Li, and Y. Zhang. Evolutionary timeline sum- marization: a balanced optimization framework via iterative substitution. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 745–754. ACM, 2011.

[122] S.-H. Yang, A. Kolcz, A. Schlaikjer, and P. Gupta. Large-scale high-precision topic mod- eling on twitter. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1907–1916. ACM, 2014.

[123] Z. Yin, L. Cao, J. Han, C. Zhai, and T. Huang. Geographical topic discovery and com- parison. In Proceedings of the 20th international conference on World wide web, pages 247–256. ACM, 2011.

[124] D. Zhang, Y. Liu, R. D. Lawrence, and V. Chenthamarakshan. Transfer latent semantic learning: Microblog mining with less supervision. In Proceedings of the 25th AAAI Confer- ence on Artificial Intelligence, pages 561–566. AAAI, 2011.

[125] L. Zhao, T. Hua, C.-T. Lu, and R. Chen. A topic-focused trust model for twitter. pages 1–11. Elsevier, 2015.

[126] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338–349. Springer, 2011.

[127] D. Zhou, H. Xu, and Y. He. An unsupervised bayesian modelling approach to storyline detection from news articles. In Proceedings of the 14th Conference on Empirical Methods in Natural Language Processing, pages 1943–1948, 2015.