Identification and Characterization of Events in Social Media
Total Page:16
File Type:pdf, Size:1020Kb
Identification and Characterization of Events in Social Media Hila Becker Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2011 c 2011 Hila Becker All Rights Reserved ABSTRACT Identification and Characterization of Events in Social Media Hila Becker Millions of users share their experiences, thoughts, and interests online, through social media sites (e.g., Twitter, Flickr, YouTube). As a result, these sites host a substantial num- ber of user-contributed documents (e.g., textual messages, photographs, videos) for a wide variety of events (e.g., concerts, political demonstrations, earthquakes). In this dissertation, we present techniques for leveraging the wealth of available social media documents to iden- tify and characterize events of different types and scale. By automatically identifying and characterizing events and their associated user-contributed social media documents, we can ultimately offer substantial improvements in browsing and search quality for event content. To understand the types of events that exist in social media, we first characterize a large set of events using their associated social media documents. Specifically, we develop a taxonomy of events in social media, identify important dimensions along which they can be categorized, and determine the key distinguishing features that can be derived from their associated documents. We quantitatively examine the computed features for different cate- gories of events, and establish that significant differences can be detected across categories. Importantly, we observe differences between events and other non-event content that exists in social media. We use these observations to inform our event identification techniques. To identify events in social media, we follow two possible scenarios. In one scenario, we do not have any information about the events that are reflected in the data. In this scenario, we use an online clustering framework to identify these unknown events and their associated social media documents. To distinguish between event and non-event content, we develop event classification techniques that rely on a rich family of aggregate cluster statistics, including temporal, social, topical, and platform-centric characteristics. In addition, to tailor the clustering framework to the social media domain, we develop similarity metric learning techniques for social media documents, exploiting the variety of document context features, both textual and non-textual. In our alternative event identification scenario, the events of interest are known, through user-contributed event aggregation platforms (e.g., Last.fm events, EventBrite, Facebook events). In this scenario, we can identify social media documents for the known events by exploiting known event features, such as the event title, venue, and time. While this event information is generally helpful and easy to collect, it is often noisy and ambiguous. To address this challenge, we develop query formulation strategies for retrieving event content on different social media sites. Specifically, we propose a two-step query formulation ap- proach, with a first step that uses highly specific queries aimed at achieving high-precision results, and a second step that builds on these high-precision results, using term extraction and frequency analysis, with the goal of improving recall. Importantly, we demonstrate how event-related documents from one social media site can be used to enhance the identification of documents for the event on another social media site, thus contributing to the diversity of information that we identify. The number of social media documents that our techniques identify for each event is potentially large. To avoid overwhelming users with unmanageable volumes of event information, we design techniques for selecting a subset of documents from the total number of documents that we identify for each event. Specifically, we aim to select high-quality, relevant documents that reflect useful event information. For this content selection task, we experiment with several centrality-based techniques that consider the similarity of each event-related document to the central theme of its associated event and to other social media documents that correspond to the same event. We then evaluate both the relative and overall user satisfaction with the selected social media documents for each event. The existing tools to find and organize social media event content are extremely lim- ited. This dissertation presents robust ways to organize and filter this noisy but powerful event information. With our event identification, characterization, and content selection techniques, we provide new opportunities for exploring and interacting with a diverse set of social media documents that reflect timely and revealing event content. Overall, the work presented in this dissertation provides an essential methodology for organizing social media documents that reflect event information, towards improved browsing and search for social media event data. Table of Contents 1 Introduction1 2 Event Definition and Characterization 11 2.1 Events in the Literature . 12 2.1.1 Topic Detection and Tracking . 12 2.1.2 Event Extraction . 13 2.1.3 Multimedia Event Detection . 15 2.2 Related Concepts: Topics, Trends, and Activities . 17 2.3 Events in Social Media . 19 3 Characterization of Trending Events in Social Media 23 3.1 Background: Twitter . 25 3.2 Trends on Twitter . 26 3.3 Collecting Trend Data . 28 3.3.1 Tweets Dataset . 28 3.3.2 Selecting Trends for Analysis . 31 3.3.3 Identifying Tweets Associated with Trends . 33 3.4 Trend Taxonomy and Dimensions . 34 3.5 Characterization of Trends and Events . 37 3.6 Categorizing Trends in Different Dimensions . 41 3.7 Quantitative Analysis . 43 3.8 Experimental Results . 46 i 3.8.1 Exogenous vs. Endogenous Trends . 46 3.8.2 Breaking News vs. Other Exogenous Trends . 48 3.8.3 Local Events vs. Other Exogenous Trends . 49 3.8.4 Memes vs. Retweet Endogenous Trends . 50 3.9 Discussion . 53 3.10 Conclusions . 56 4 Identification of Unknown Events and Their Content 57 4.1 Clustering Framework . 58 4.2 Separation of Event and non-Event Content on Twitter . 60 4.2.1 Identification of Event Clusters . 61 4.2.2 Cluster-Level Event Features . 62 4.2.3 Event Classification . 67 4.2.4 Experiments . 67 4.3 Conclusions . 76 5 Similarity Metric Learning for Identification of Unknown Events 78 5.1 Learning Similarity Metrics for Clustering . 80 5.1.1 Social Media Document Representations . 81 5.1.2 Clustering Quality Metrics and Parameter Settings . 83 5.1.3 Ensemble-based Similarity . 85 5.1.4 Classification-based Similarity . 89 5.1.5 Experiments . 91 5.2 Exploiting Social Links . 101 5.2.1 Link-based Similarity . 102 5.2.2 Exploratory Experiments . 103 5.3 Conclusions . 107 ii 6 Identification of Content for Known Events 108 6.1 Motivation and Approach . 110 6.2 Precision-Oriented Query Building Strategies . 113 6.3 Recall-Oriented Query Building Strategies . 115 6.4 Leveraging Cross-Site Content . 120 6.5 Experiments . 121 6.5.1 Experimental Settings . 122 6.5.2 Experimental Results . 126 6.6 Event Tracking System . 130 6.6.1 Browser Plug-In . 131 6.6.2 Customizable Web-based Interface . 132 6.7 Conclusions . 134 7 Selection of Event Content 135 7.1 Identifying Event Content . 136 7.1.1 Content Selection Goals . 137 7.1.2 Content Selection Approaches . 138 7.2 Experiments . 140 7.2.1 Experimental Settings . 140 7.2.2 Experimental Results . 141 7.3 Conclusions . 142 8 Related Work 145 8.1 Event Identification in Textual News . 145 8.2 Trend Analysis in Social Media . 147 8.3 Event Identification in Social Media . 149 8.4 Large-Scale Data Clustering . 151 8.5 Social Media Content Summarization, Topic Discovery, and Analytics . 152 9 Conclusions and Future Work 155 9.1 Clustering Framework Optimization . 156 iii 9.2 Identifying Unknown Events with Learned Similarity Metrics Across Sites . 158 9.3 Improving Breadth of Event Content . 160 9.4 Ranking Events for Search and Presentation . 160 Bibliography 164 A Normalized Mutual Information and V-Measure 179 iv List of Figures 1.1 Comparison of the presence of non-dictionary terms in AP newswire versus Twitter documents. .3 2.1 Examples for the \Making a cake" event. 17 3.1 Trending terms, on the dark blue (middle) banner, on Twitter's home page (2010). 27 4.1 Conceptual diagram: Twitter event identification. 61 4.2 Documents per hour with the term \valentine" for 72 hours prior to 2 p.m. on Valentine's Day. 63 4.3 Examples of social interaction on Twitter. 65 4.4 Distribution of labels for our classifier and baselines. 72 4.5 Precision@K for our classifier and baselines. 74 4.6 NDCG@K for our classifier and baselines. 74 4.7 NDCG@20 of our classifier and baselines for each hour over the test set. 75 5.1 A Flickr photograph associated with the \All Points West" music festival event. 82 5.2 A conceptual diagram of an ensemble clustering process. 86 5.3 NMI scores on the Upcoming test dataset. 96 5.4 NMI scores on the Last.fm test dataset. 97 5.5 B-Cubed scores on the Upcoming test dataset. 97 5.6 B-Cubed scores on the Last.fm test dataset. 98 v 5.7 Comparison of all techniques using the Nemenyi test. Groups of techniques connected by a line are not significantly different at p < 0:05. 99 5.8 Homogeneity scores on the Upcoming test dataset. 100 5.9 Completeness scores on the Upcoming test dataset. 101 5.10 Comments between authors in two event clusters. 104 5.11 B-Cubed Precision scores on the Upcoming test dataset. 106 5.12 B-Cubed Recall scores on the Upcoming test dataset. 106 6.1 A Last.fm event record for the \Celebrate Brooklyn!" opening night gala and concert.