Labeling, Analyzing, and Simulating Tweets Produced in Disasters

Peter M. Landwehr Thesis Proposal Labeling, Analyzing, and Simulating Tweets Produced in Disasters Peter Maciunas Landwehr Computation, Organizations, and Society Institute for Software Resource School of Computer Science Carnegie Mellon University [email protected] September 30, 2013 Abstract Relief organizations use social media to stay abreast of victims’ needs during disaster and to help in planning relief efforts. Organizations hire dedicated teams to find posts that contain substantive, actionable information. Simulation and training exercises play important roles in preparing for disaster; tools like HAZUS-MH and FlamMap help to predict the physical damages and costs that will result when hurricanes and wildfires strike. Yet few if any disaster simulations incorporate the ways that a disaster can cause an outpouring of comment and useful information on the web. In disasters such as the 2010 Haiti Earthquake, tweets and SMS messages have been translated and coded using crowdsourcing platforms to divide up labor. Machine learning has been used to parse tweets for keywords and then intuit earthquake epicenters from physical position. Researchers have developed several different coding schemes for tweets to identify different types of useful content. While unskilled coders can apply some schemes, others require nuanced, subjective judgments. I propose to address the problems posed by the dearth of training simulations and the difficulties of applying salient codings to tweets by developing a novel simulation of social media in disaster, experimenting with using crowdsourcing to apply a set of sophisticated labels, and using the coded to data to train a machine learning classifier. The new simulation will produce sequences of “pseudo-tweets” that consist of content labels and a collection of common features, such as hashtags and retweet indicators, as output. It will be validated against longitudinal analyses of tweets produced during Hurricane Sandy and the 2012 Colorado Wildfires, and can be incorporated into future disaster preparation tasks. Prior to analysis, a subset of tweets from each disaster will be coded with a sophisticated labeling scheme developed by Vieweg, using both crowd workers and trained linguistics students, with the results of each group’s coding compared against the other. The gold-standard data will be used to train a machine learning classifier for applying the codings in the future. Committee: Dr. Kathleen M. Carley (Chair) Dr. Jason Hong Dr. Jürgen Pfeffer Dr. Sarah Vieweg 1 Introduction: Problem Domain and Research Field Social media platforms are an established part of society. Their users continue to post content to them even when in the presence of the immediate onset of disaster. Relief Organizations have begun to see these posts (and the tendencies driving these posting practices) as something that can be leveraged to address their various functions in disaster. During Hurricane Sandy, for example, many people afflicted by the storm posted continuous updates to Twitter [1]. Individuals living in the area afflicted by 1 Peter M. Landwehr Thesis Proposal the 2008 spill of coal ash into the Tennessee River attempted to use Twitter to increase public awareness of the event [2]. Relief groups have correspondingly begun turning to Twitter as a source of additional information about disasters as they transpire [3], [4]. The central difficulty of working with social media is the large volume of noise present on the platform. While good search term choice can help to limit the amount of noise present in the system, locating relevant terms remains a difficult problem. One approach to this is that used by Sakaki et al. [5]. The researchers used the distribution of different earthquake-related keywords across Japan to predict the epicenters of earthquakes. Similarly, Asur and Huberman have used the relative frequencies of mentions of particular movies on Twitter to predict weekend box office performance [6]. Another group of researchers has looked at different ways that tweets can be categorized and grouped on the assumption that they may contain additional useful information. These tweets have ranged from the moderately interpreted —Naaman et al.’s categorization of tweets as “sharing information”- and the very subjective but likely useful –Verma et al.’s “situational awareness” label [7], [8]. Similarly, both researchers and the private sector are building tools to help relief groups rapidly filter social media data for relevant information [9]–[11]. While relief groups are working with Twitter data during real disasters in order to understand the situation on the ground, simulation has been understood to be a critical component of preparing for different disasters [12]. Yet despite knowing that social media will play a role in any disaster response little attention has been given to reproducing the form and content of the relief messages produced on social media. One exception to this was a live simulation of a disaster carried out at Arizona State University in 2011, in which students were recruited to produce proxy tweets [13]. However, the simulation did not incorporate mechanisms to guarantee that students would produce tweets in a manner resembling the true distribution of tweets in disaster. In this thesis, I will seek to address the dearth of training simulations for social media by developing a simulation that produces pseudo-tweets, each of which contains no text but includes a set of representative tweet features and a particular mix of category labels. In order to create data that can be used with the simulation, I will categorize a subset of two corpora of tweets using labels developed by Vieweg. These labeled tweets will be used as training data for a classifier to automatically code all tweets in the corpora that were not labeled in the first phase. Once categorized, all of the labeled tweets will be used as the basis for a general analysis of the two disasters as represented on Twitter. This analysis and the distributions of the different labels will then be used to provide the simulation a basis in reality. These labels will then be used as the basis for a face validation of the results. 2 Research Questions and Contributions This thesis will make four contributions, two secondary and two primary. Three of these contributions (crowdsourcing a sophisticated tweet coding scheme, developing a new classifier for tweets, and conducting a longitudinal analysis of tweets made during disasters) are intended to support the fourth goal, building a simulation of the tweets produced over the course of two different kinds of disasters. The contributions can each be thought of as addressing one of three research questions: • How can we inexpensively and rapidly classify tweets so that relief workers can use them? • How are disasters represented in the content of social media? • How can we usefully reproduce the content produced on social media during disasters? 2 Peter M. Landwehr Thesis Proposal With these research questions as a basis, the single thesis statement for this work would be: “In order to aid first responders in dealing with Twitter data gleaned during disaster, I propose to experiment with how to efficiently and meaningfully label tweets, to analyze how tweets produced in disaster relate to the events occurring in the disaster, and to develop a representative simulations of the tweets produced in a disaster situation.” Below, the four contributions are discussed in more detail. 2.1 Crowdsourcing a Sophisticated Coding Scheme When creating her own gold-standard data, Vieweg trained two linguistics graduate students in the nuances of her coding scheme and acted as an arbitrator for their decisions [14]. This process took time and effort, and cannot be applied to live data collected during a real disaster. As Vieweg herself mapped out how tweets with particular codes could contain data of interest to relief groups, it would be useful if her coding could be applied during the course of disaster as data arrives. Faster labeling would play an important role in getting additional information to authorities sooner, delivering aid more quickly, and reducing the costs of providing relief. In this thesis, I will attempt to use workers on a crowdsourcing platform to apply her coding scheme to tweets. Crowd workers have already played crucial roles coding data in other disasters, notably in the 2010 Haiti Earthquake [15]. The central challenge of this task will be converting Vieweg’s instructions into training that can be used to guarantee that crowd workers will perform quality coding despite lacking the same background training. 2.2 Developing a New Classifier for Tweets While successfully crowdsourcing the labeling of tweets will decrease the time and expense of coding, developing a functioning machine learning classifier for applying those same labels would have a more dramatic effect, either used alone or in tandem with crowdsourcing. Vieweg proposes and tests one theoretical model for a classifier in her thesis, and Verma et al. developed a very effective classifier for “situational awareness”, the top-level category in Vieweg’s hierarchy of labels [8]. A new classifier that can effectively apply Vieweg’s labels and categories to tweets that have been identified as possessing situational awareness will further decrease the cost of processing tweets while also pointing the way forward for a system for on-line coding of tweets during a disaster. Just as Vieweg’s labels exist in a hierarchy, it is likely that classification will be done using a set of classifiers to label tweets as containing situational awareness information, one or more categories, and a mix of category labels. Verma et al. were highly successful at training a situational awareness classifier for disaster data, and I hope to build on their success [8]. Given the number of labels developed by Vieweg and their uneven distribution in different disaster, it is conceivable that it won’t be possible to train a classifier for every label, but rather just for those present in my data.

Labeling, Analyzing, and Simulating Tweets Produced in Disasters

Dissertation Crises Unwasted

April 2009 Focus on Fire Safety: Planning for Wildfires

Evaluating the Reverse 9-1-1 System in Santa Clara County: Does the Process Work?

Cedar Mesa Ranches CWPP