Peter M. Landwehr Thesis Proposal

Labeling, Analyzing, and Simulating Tweets Produced in Disasters Peter Maciunas Landwehr Computation, Organizations, and Society Institute for Software Resource School of Computer Science Carnegie Mellon University [email protected]

September 30, 2013

Abstract

Relief organizations use social media to stay abreast of victims’ needs during disaster and to help in planning relief efforts. Organizations hire dedicated teams to find posts that contain substantive, actionable information. Simulation and training exercises play important roles in preparing for disaster; tools like HAZUS-MH and FlamMap help to predict the physical damages and costs that will result when hurricanes and wildfires strike. Yet few if any disaster simulations incorporate the ways that a disaster can cause an outpouring of comment and useful information on the web. In disasters such as the 2010 Haiti Earthquake, tweets and SMS messages have been translated and coded using crowdsourcing platforms to divide up labor. Machine learning has been used to parse tweets for keywords and then intuit earthquake epicenters from physical position. Researchers have developed several different coding schemes for tweets to identify different types of useful content. While unskilled coders can apply some schemes, others require nuanced, subjective judgments. I propose to address the problems posed by the dearth of training simulations and the difficulties of applying salient codings to tweets by developing a novel simulation of social media in disaster, experimenting with using crowdsourcing to apply a set of sophisticated labels, and using the coded to data to train a machine learning classifier. The new simulation will produce sequences of “pseudo-tweets” that consist of content labels and a collection of common features, such as hashtags and retweet indicators, as output. It will be validated against longitudinal analyses of tweets produced during Hurricane Sandy and the 2012 Wildfires, and can be incorporated into future disaster preparation tasks. Prior to analysis, a subset of tweets from each disaster will be coded with a sophisticated labeling scheme developed by Vieweg, using both crowd workers and trained linguistics students, with the results of each group’s coding compared against the other. The gold-standard data will be used to train a machine learning classifier for applying the codings in the future. Committee:

Dr. Kathleen M. Carley (Chair) Dr. Jason Hong Dr. Jürgen Pfeffer Dr. Sarah Vieweg 1 Introduction: Problem Domain and Research Field Social media platforms are an established part of society. Their users continue to post content to them even when in the presence of the immediate onset of disaster. Relief Organizations have begun to see these posts (and the tendencies driving these posting practices) as something that can be leveraged to address their various functions in disaster. During Hurricane Sandy, for example, many people afflicted by the storm posted continuous updates to Twitter [1]. Individuals living in the area afflicted by

1

Peter M. Landwehr Thesis Proposal the 2008 spill of coal ash into the Tennessee River attempted to use Twitter to increase public awareness of the event [2]. Relief groups have correspondingly begun turning to Twitter as a source of additional information about disasters as they transpire [3], [4]. The central difficulty of working with social media is the large volume of noise present on the platform. While good search term choice can help to limit the amount of noise present in the system, locating relevant terms remains a difficult problem. One approach to this is that used by Sakaki et al. [5]. The researchers used the distribution of different earthquake-related keywords across Japan to predict the epicenters of earthquakes. Similarly, Asur and Huberman have used the relative frequencies of mentions of particular movies on Twitter to predict weekend box office performance [6]. Another group of researchers has looked at different ways that tweets can be categorized and grouped on the assumption that they may contain additional useful information. These tweets have ranged from the moderately interpreted —Naaman et al.’s categorization of tweets as “sharing information”- and the very subjective but likely useful –Verma et al.’s “situational awareness” label [7], [8]. Similarly, both researchers and the private sector are building tools to help relief groups rapidly filter social media data for relevant information [9]–[11]. While relief groups are working with Twitter data during real disasters in order to understand the situation on the ground, simulation has been understood to be a critical component of preparing for different disasters [12]. Yet despite knowing that social media will play a role in any disaster response little attention has been given to reproducing the form and content of the relief messages produced on social media. One exception to this was a live simulation of a disaster carried out at Arizona State University in 2011, in which students were recruited to produce proxy tweets [13]. However, the simulation did not incorporate mechanisms to guarantee that students would produce tweets in a manner resembling the true distribution of tweets in disaster. In this thesis, I will seek to address the dearth of training simulations for social media by developing a simulation that produces pseudo-tweets, each of which contains no text but includes a set of representative tweet features and a particular mix of category labels. In order to create data that can be used with the simulation, I will categorize a subset of two corpora of tweets using labels developed by Vieweg. These labeled tweets will be used as training data for a classifier to automatically code all tweets in the corpora that were not labeled in the first phase. Once categorized, all of the labeled tweets will be used as the basis for a general analysis of the two disasters as represented on Twitter. This analysis and the distributions of the different labels will then be used to provide the simulation a basis in reality. These labels will then be used as the basis for a face validation of the results.

2 Research Questions and Contributions This thesis will make four contributions, two secondary and two primary. Three of these contributions (crowdsourcing a sophisticated tweet coding scheme, developing a new classifier for tweets, and conducting a longitudinal analysis of tweets made during disasters) are intended to support the fourth goal, building a simulation of the tweets produced over the course of two different kinds of disasters. The contributions can each be thought of as addressing one of three research questions:

• How can we inexpensively and rapidly classify tweets so that relief workers can use them? • How are disasters represented in the content of social media? • How can we usefully reproduce the content produced on social media during disasters?

2

Peter M. Landwehr Thesis Proposal

With these research questions as a basis, the single thesis statement for this work would be: “In order to aid first responders in dealing with Twitter data gleaned during disaster, I propose to experiment with how to efficiently and meaningfully label tweets, to analyze how tweets produced in disaster relate to the events occurring in the disaster, and to develop a representative simulations of the tweets produced in a disaster situation.” Below, the four contributions are discussed in more detail.

2.1 Crowdsourcing a Sophisticated Coding Scheme When creating her own gold-standard data, Vieweg trained two linguistics graduate students in the nuances of her coding scheme and acted as an arbitrator for their decisions [14]. This process took time and effort, and cannot be applied to live data collected during a real disaster. As Vieweg herself mapped out how tweets with particular codes could contain data of interest to relief groups, it would be useful if her coding could be applied during the course of disaster as data arrives. Faster labeling would play an important role in getting additional information to authorities sooner, delivering aid more quickly, and reducing the costs of providing relief. In this thesis, I will attempt to use workers on a crowdsourcing platform to apply her coding scheme to tweets. Crowd workers have already played crucial roles coding data in other disasters, notably in the 2010 Haiti Earthquake [15]. The central challenge of this task will be converting Vieweg’s instructions into training that can be used to guarantee that crowd workers will perform quality coding despite lacking the same background training.

2.2 Developing a New Classifier for Tweets While successfully crowdsourcing the labeling of tweets will decrease the time and expense of coding, developing a functioning machine learning classifier for applying those same labels would have a more dramatic effect, either used alone or in tandem with crowdsourcing. Vieweg proposes and tests one theoretical model for a classifier in her thesis, and Verma et al. developed a very effective classifier for “situational awareness”, the top-level category in Vieweg’s hierarchy of labels [8]. A new classifier that can effectively apply Vieweg’s labels and categories to tweets that have been identified as possessing situational awareness will further decrease the cost of processing tweets while also pointing the way forward for a system for on-line coding of tweets during a disaster. Just as Vieweg’s labels exist in a hierarchy, it is likely that classification will be done using a set of classifiers to label tweets as containing situational awareness information, one or more categories, and a mix of category labels. Verma et al. were highly successful at training a situational awareness classifier for disaster data, and I hope to build on their success [8]. Given the number of labels developed by Vieweg and their uneven distribution in different disaster, it is conceivable that it won’t be possible to train a classifier for every label, but rather just for those present in my data. In that case, I will both use crowd workers or linguistic students to code more tweets and define success around classifying only labels present in the data.

2.3 Longitudinal Analysis of Salient Tweet Features During Two Disasters The goal of the longitudinal analysis is to characterize how the frequencies of particular categories of tweets, as well as their salient features, change in distribution over the course of disaster and in response to particular events during a disaster. It is known that individuals who experience power outages, flooding, and other disaster related events often comment on their existence immediately; this fact has been leveraged by Sakaki et al. to approximate the locations of earthquake epicenters in Japan and has been compared with US Geological Survey reporting by Earle et al. [5], [16]. De Longueville et al. had mixed success attempting to approximate a forest fire’s extent using tweets [17].) However it is unclear how the patterns and categories of

3

Peter M. Landwehr Thesis Proposal tweets reporting particular types of suffering during a disaster change and multiply as the disaster continues, ends, and moves on into its aftermath. Some similar work has occurred: Sato et al. discuss initial approaches to linguistically analyzing the tweets collected in a corpus during a disaster, while Kongthon et al. looked at the changes in types of tweets sent over time during a 2011 Thailand flood [18], [19]. (Limited information about coding assignment or validity is incorporated in the latter study.) The aforementioned work by De Longueville et al. compares the frequency of collected tweets with critical events in the forest fire’s growth and decay. Vieweg’s categories, however, provide a precise lens for viewing disasters that have not been used in this type of study, as well as a framing for connecting the needs expressed on Twitter to relief services.

2.4 Simulating the Salient Features of Tweets Produced During a Disaster Simulations play an import role in disaster preparation [12]. Relief groups can use simulation exercises to help train for the approximate circumstances of a given disaster. In general, such simulations are best known for capturing the costs and types of physical damage that a disaster will wreak [20]. Few attempt to capture how individuals use social media to communicate their conditions in a disaster. My simulation will address this problem by focusing on reproducing the proximate contents of tweets that occur during different disasters, providing an approximation of the messages that relief workers will be likely to encounter. The production of tweets will be driven by user specification of the size and start of the disaster event. The simulation will then spawn a number of “pseudo-tweets” based on particular events determined to have occurred during the disaster. These “pseudo-tweets” will contain a variety of salient features seen in tweets such as hashtags, retweet indication, mention counts, type of user. They will be representative of those tweets using the most common, surface-level hashtags. For a training exercise, pseudo-tweets would be generated in advance, archived, and spooled out to individuals in charge of monitoring social media. The response team would then use the raw tweets to search for particular hashtags or users, look at what they say, and report to others simulating the role of relief groups. These responders would correspondingly resolve the reported incidents.

3 Background and Related Work My literature review is divided into several distinct segments. In the first, I attempt to provide a characterization of Twitter: usage numbers, demographics, and some research findings about how the service tends to be used. In the second, I examine some of the literature on human behavior in disaster. In the third I look at the role new media has played in disaster, with a specific emphasis on Twitter. In the fourth and fifth, I look at how simulations have contributed to disaster response and how they have been used to understand and reproduce information exchange and learning.

3.1 Twitter outside of Disasters Twitter is one of the most well-known and researched microblogging platforms in the world. The service allows users to post 140-character messages that can be read by the entire public, only by the other users that one has decided to “follow”, or that are directly sent to a particular user who is following the sender. In addition to posting, users can follow others, seeing everything that they have chosen to post. All tweets can be searched using particular keywords or phrases, and the site actively maintains a list of “trending topics” indicating what content is the most important at the moment. Breaking through to the trending topics list is often seen as an indicator of having attained viral success. Users can also subdivide those who they follow into particular, dedicated lists in order to create dedicated channels for certain types of tweets.

Developers can download tweets from the so-called “firehose” by partnering with one of several data brokers with which the company has partnered. Twitter also grants developers free access to the “streaming API”, which can be used to download

4

Peter M. Landwehr Thesis Proposal up to 1% of all tweets at a given time. Morstetter et al. compared samples of tweets extracted from Twitter using the API and the firehose when searching for tweets related to the Arab Spring [21]. They found that streaming API coverage failed to capture a representative distribution of tweets relative to the true distribution, hypothesizing that this was be due to a global decrease in the total number of tweets. Because the total number of tweets dropped, Twitter decreased the number of samples in order to remain at the 1% level.

Launched in 2006, Twitter has largely received positive media coverage [22]. Precise numbers about the size of the user base are difficult to ascertain. In March of 2011, Twitter reported that an average of 460,000 accounts were created each day during the past month [23]. An average of 140 million tweets were sent each day. In July of 2012, Semiocast projected that Twitter passed 500 million total accounts in June [24]. In January of the same year, the company had calculated that only 27% of all accounts were actively updated, and that 141.8 million of their 517 million account sample were in the US [25]. This would suggest approximately 38.2 million US accounts are active. For comparison, in 2012 GlobalWebIndex reported that there were 22.9 million active US Twitter users; they also reported that Twitter was more popular in China and India than in the U.S., something not seen in Semiocast’s data [26]. In August of this year, Twitter reported normally seeing an average of 5,700 tweets per second [27]. According to the Pew Internet & American Life Project, in 2012 15% of online Americans use Twitter [28]. 26% of online 18-29 year olds use Twitter, and 28% of online non-Hispanic Blacks. These demographics both have statistically significantly larger membership than other age blocks and races. Similarly, urban and suburban dwellers are statistically significantly more likely to use Twitter than are Americans living in rural areas.

In 2007, Java et al. performed a network analysis of the relationships between 87,897 users, all of whom had posted between April 1 and May 30 [29]. They fit the network to a power law distribution and loosely classified user relations based on network position as “information sources”, “friends”, and “information seekers”, and argued that mutual following could be used to locate communities of users. The researchers used. In 2009, Kwak et al. reported crawling the entire Twittersphere, which at the time consisted of 41.7 million users [30]. The researchers found that the average path length between different users was only 4.12, which was surprisingly short given that only 22.1% of ties are reciprocated; the researchers hypothesize that this may have to do with Twitter’s nature as a broadcast medium (where users follow those they are unlikely to know in real life) rather than a network of social relations. Also in 2009, Naaman et al. analyzed 3,379 messages from 350 users and developed their own categorization for users based on content [7]. The researchers described one group of users as “me-formers” and another group as “informers”: users who tend to talk about themselves and users who tend to link to other information. Bakshy et al. looked at the likelihood of seeing a particular URL get propagated by a user’s followers as a measure of influence [31]. While they were able to identify users who were influencers, the researchers were unable to locate any particular features that would guarantee that a particular influencer’s tweet would result in a cascade of retweets. As such, they propose guaranteeing a tweet’s circulation by getting it mentioned by a large number of un-influential users first, who may eventually cause an outbreak leading to a true cascade.

3.2 People and Organizations in Disasters While popular media often depicts individuals in disasters as engaging in panicked, flight behavior, this is incorrect. Despite common representation in the media, people in disaster rarely panic. Rather, they make rational decisions intended to guarantee their own safety and the safety of others. As described in 1970 by Dynes, “Panic is infrequent and does not occur on a mass

5

Peter M. Landwehr Thesis Proposal scale. Disaster victims act positively, not irrationally or passively. Mutual help and self-help are frequent. Psychological disturbances do not render the impacted population helpless. Much of the individual rescue work is done by the victims who do not wait to be told what to do.” [32] Indeed, the individuals most likely to rely on state services for relief are those who don’t have other social contacts upon whom they can rely. It is the individuals in the area of the disaster who end up doing much of the early relief work, before aid agencies can arrive ([33] citing [34]). According to Dynes, assessing the damage that has been done is the “most significant initial task” facing the community. Relief organizations active in a disaster area are intensely dependent on recovering new information. They often function in an ad-hoc manner, improvising the tasks needed to perform well [35]. The many organizations active on-site at any one time may not have any sort shared situational awareness, which can make function in a hierarchy particularly difficult. In the aftermath of the 2010 Haiti earthquake, the various relief groups on the ground reported having difficulty working with each other [4]. Every organization appeared to bring their own software and data, and it was difficult for any one group to take charge.

3.3 New Media and Twitter’s Role in Disasters When individuals turn to new media in disaster, they do so to address established needs. For example, in Shklovski et al. (2008), the researchers describes how a group of Californians felt that local media was failing to provide them with the information they needed about wildfire locations. To solve the problem, the Californians set up online forums where they could each contribute and report information [36] Similarly, in the immediate aftermath of Hurricane Katrina musicians in New Orleans began to adopt new technologies such as SMS texting and the web in order to stay in touch [37]. In both cases, current technologies were being leveraged in service of traditional needs to maintain safety and community. After the 2010 Haiti Earthquake, victims used SMS messaging to report their needs using the 4636 SMS short code; messages were aggregated and translated by local volunteers organized via a crowdsourcing platform [38]. Victims’ needs weren’t unconventional, but the combination of SMS messaging and crowdsourcing was. Twitter fits into this pattern of being used by individuals to address established human needs. When Sarcevic et al. studied tweets by medical relief groups during the 2010 Haiti earthquake, they identified a behavior they referred to as “beaconing”. These medical workers would post their needs and their current status to Twitter, as opposed to using a more directed –or private channel. One of the most important behaviors that should be acknowledged in this realm is “beaconing”, described by Sarcevic et al. in their study of tweets by medical relief groups during the 2010 Haiti earthquake [39]. Similarly, the media has covered accounts of how individuals in disaster have used Twitter as a platform to stay informed about others in the afflicted area [1], [40]. Mendoza et al., when looking at a small pool of verified true and false pieces of information propagated during the 2010 Chilean earthquake, found that false statements were often contested and rejected by the crowd while the true were rarely contradicted [41]. That said, Twitter users failed to dispute general rumors about looting in Santiago, suggesting that it may be difficult to trust Twitter when it comes to vague, general rumors and not specific sightings. Relatedly, Pfeffer et al. have looked at the particular risks posed by the periodic explosive growth of particular online stories in tense situations, and have attempted to characterize the features that make such growth possible. [42]. Even as individuals reach out and describe what they experience in different crisis situations, a variety of research looking at how Twitter is used in disaster have found that only a small percentage of tweets contain any form of actionable data. Other researchers have found that only a small percentage of the tweets found in a fire contain directly actionable information [43]–

6

Peter M. Landwehr Thesis Proposal

[45]. (Sinnapan et al., while only finding that 5% of tweets collected during Australia’s Black Saturday fires contained directly actionable information, report that 22% of tweets contained some form of useful data.) Getting to this nugget of actionable information remains a core challenge of working with Twitter data. In the 2011 Shadow Lake Fire, for example, the Portland branch of the National Incident Management Organization (NIMO) used a dedicated group of volunteers known as a Virtual Operations Support Team (VOST) to keep track of useful social media activity [3]. Kris Eriksen the head of the Portland Branch cited an example of the VOST finding a blog by a local explaining how fire trucks were badly routed through back roads, saying that such information seeking was “EXACTLY what I need a [VOST] to do.” When looking at tweets collected during the Red River flood, Starbird et al. found that original tweets containing information about the flood made up less than 10% of the sample, but that 80% of those tweets were made by locals [45]. It is reasonable to contend that locality can at least be used as a rough first test for utility. Starbird et al. attempted to leverage this by training a support vector machine to identify members of Occupy Wall Street who were tweeting locally, albeit only achieving 67.9% accuracy [46]. Verma et al. proved more successful when attempting to solve the problem by using a trained Maximum Entropy classifier to identify tweets that exhibited “situational awareness” (SA) – that contained tactical, actionable information – in four different disasters and on a pseudo-disaster made of a uniform mix of random tweets taken from the other data. Using a combination of bigram features, unigram features, and classifiers for tone, subjectivity, and register, the researchers managed to get an above 80% accuracy rate for correctly labeling tweets as possessing or failing to possess SA. (Individual rates not included). Vieweg, who contributed to the work with Verma, developed a set of 35 non-exclusive labels for the different kinds of data that can be found in SA tweets, working with both data and the literature on disasters [14]. Each of the labels exists within a category, identifying relevance to the social, built, or physical environment. (S, B, or P), and is supported by a nuanced description. (24, 4, and 7 labels respectively) Working with VerbNet, a class-based verb lexicon, she found that the nine different classes of verbs described in VerbNet could be exclusively matched to one of the three category labels. In toto, of the tweets known to contain social, built, and physical environment information, 13.5%, 15.0%, and 30.3% contained the referenced verbs.

3.4 Simulating Disaster Two important reasons for simulating disaster situations are to estimate the amount of damage that a disaster will cause and to prepare for the coming scale of the disaster. As such, established simulation models for disaster tend to focus on calculating the physical damage that a particular disaster will do to an area, the dollar cost of such damage, and the amount of life that might be lost. That said, these models differ strongly on their internals because of the significantly different properties of the natural phenomena that they represent.

The US Forest Service’s Missoula Research Station, for example, has released a collection of different models that take as input different natural parameters including terrain, fuel, and weather and produce projections of how wildfires can be expected to grow. FARSITE and FlamMap are two long models developed by this group that have been used by the Forest Service to predict find areas that may be damaged in controlled burns so that they can be kept from spreading and to calculate the potential damage a wildfire might cause [47]–[49].

7

Peter M. Landwehr Thesis Proposal

FEMA distributes HAZUS-MH is a free, comprehensive, customizable physical simulation of the damage inflicted by earthquakes, floods, and hurricanes [20], [50]. The model can be used with general default settings, or can take as input detailed models of the buildings in the region to be simulated, local soil data, economic information, and other elements; it estimates damages to human life, physical structures, and the creation of debris. In contrast, Florida’s Public Hurricane Loss Model (PHLM) is an actual model primarily intended for estimating insurance losses [51]. The model uses historical records of insurance losses in Florida, records of hurricanes from previous years, and data about the ways in which particular structures take physical damage.

Both HAZUS and PHLM are composed of multiple individual models, used in tandem to predict the amount of damage that areas will see. These component damage models have been validated by comparisons with physical damage data collected from a number of hurricanes. Researchers also periodically validate HAZUS by comparing it with other disasters. For example, Remo and Pinter found that that the HAZUS earthquake model significantly overestimated damage caused by the 2005 Mt. Carmel earthquake [52].

3.5 Simulating Information Transmission There is relatively little work on specifically simulating the production of content in the digital medium. That said, a variety of research has been done looking at how we can reproduce the network structures underlying digital communities. Using data from a variety of communities, Leskovec at al. have defined a “forest fire” model that reproduces a variety of different online networks of different sizes, as well as thoroughly reviewing much of the literature on generative network models [53]. Similarly, a variety of work has been done looking to try to understand different activities in online media, sometimes incorporating simulation. Romero et al., for instance, developed a classification for different types of hashtags and simulated hashtag propagation [54].

My simulation will use loose approximations of individual users to produce tweets. It fits into the longer tradition of using agent-based models as a way of representing individuals’ decisions. This connects back to a variety of agent based models such as the Schelling segregation model and Axelrod’s model of the dissemination of culture. These both use simplified representations of individuals, modeled as having only one facet to their personalities and as being fixed in a particular location with a grid or number line [55], [56]. In contrast, in previous research I have worked with other researcher to use the CONSTRUCT modeling system to situate a number of tribal actor within different relationships to approximate changes of belief [57], [58]. Each of these models provides a limited set of characteristic features for individuals that can be used to describe their different beliefs. While useful in certain circumstances, in this particular simulation it is unnecessary to suppose any kind of relational structure between the represented individuals.

4 Data For the analysis portion of this thesis, I will be reviewing collections of tweets sent during two 2012 disasters: those messages sent during Hurricane Sandy, and those sent during the 2012 Colorado Wildfires. Both sets of tweets were collected using the TweetTracker software developed at Arizona State University. [9] TweetTracker provides a GUI-based interface to the Twitter API, allowing users to specify keywords, particular users, or geographic bounding boxes to constrain their searches. Because our tweets were collected using keyword-based searches, they should only be considered representative of those tweets likely to be selected by relief workers or other actors attempting to use Twitter to find disaster-related tweets during a particular

8

Peter M. Landwehr Thesis Proposal disaster event. Similarly, the data were collected using the Twitter API, which only includes 1% of all tweets, and so are not the totality of tweets using these particular keywords. The data do qualify as a representative sample of the sorts of tweets that a user searching for tweets using disaster-related keywords might draw.

In addition to describing the data, I also provide some contextualizing information about both Hurricane Sandy and the 2012 Colorado Wildfires. The context of each disaster is important to understanding the tweets produced.

4.1 Hurricane Sandy According to the National Hurricane Center, Sandy originated from a tropical wave that departed Africa on October 11, 2012, moving west [59]. At approximately 0700 EST on October 24, 80 nautical miles south of Kingston, Jamaica, it finished evolving into a hurricane. It moved north, making landfall in Jamaica at Bull Bay at 1400 EST and continued upwards through Cuba and the Bahamas, and then moved northeast towards the East Coast. While losing speed, by 0700 EST on October 27 it had once again gained hurricane force but had a particularly unusual configuration: “Reconnaissance data indicated that the radius of maximum winds was very large, over 100 [nautical miles], and the strongest winds were located in the western semicircle of the cyclone. In addition, satellite, surface and dropsonde data showed that a warm front was forming a few hundred miles from the center in the northeast quadrant, with another weak stationary boundary to the northwest of the center” On October 26 and 27, seven states and the District of Columbia declared states of emergency [60]. Moving up the east coast, Sandy reached peak intensity at about 0700 EST on October 29, while its center was 45 nautical miles southeast of Atlantic City. It made landfall near Brigantine, New Jersey at 1830 EST, then pushed west-northwest and eventually broke up over Ohio after 0700 EST on October 31. There was significant warning that Sandy would strike the US; stories covering the likelihood of a “perfect storm” hitting the US began appearing as early as October 25. Leading up to the disaster, Twitter launched a dedicated “hashtag page” that presented both tweets about Sandy and important Twitter accounts to follow in order to get information related to the disaster [40]. The National Hurricane Center and other organizations advertised that people could receive updates from them by checking Twitter. [61] In the storm’s immediate aftermath, various news outlets reported on how Twitter had played an important role in keeping people apprised of the hurricane’s progress [1], [62]. The media characterized the flow of tweets as “riveting”, a way of seeing what was happening during the storm on a moment-to-moment basis. Sandy has been estimated to have directly caused 72 deaths in the U.S. (75 outside of it) and indirectly caused 87 more. According to the National Hurricane Center, Sandy inflicted approximately 50 billion dollars of property damage and left approximately 8.5 million people without power, some for months after the disaster. (On November 1, the number is 4.8 million.) New York Governor Mario Cuomo estimated damage to his state at $32.8 billion, while New Jersey Governor Chris Christie estimated damage worth $36.8 billion [63]. As late as August 15 of this year, news stories have continued to appear about the slowness of the recovery process [64].

4.2 Tweets collected during Hurricane Sandy I will be working with two sets of tweets collected during Hurricane Sandy, between which there will be some degree of overlap. Both sets of tweets were collected using the TweetTracker software developed at Arizona State University.

9

Peter M. Landwehr Thesis Proposal

The first set of Sandy data consists of 5,634,311 tweets by 3, 036,353 users collected between October 25, 2012, 3:42 PM and November 3, 2012, 11:59 PM. Tweets were acquired if they used any of 34 keywords or were made by any of 72 users. Tweets were briefly acquired if they came from a region in the bounding box with a southwest corner at (29.571, -76.289) and a northeast corner at (40.555, -75.212) [65]. The bounding box was discarded soon after collection began because it was acquiring too much noise. The different selectors were specified by Dr. Rebecca Goolsby based on the recommendations of different subject matter experts with whom she works as a program coordinator; the geographic region was chosen based on predictions about where sandy would be likely to afflict. The chosen keywords were intended to pull out mentions of the storm, conditions in the regions it was afflicting, and different issues that could be caused by the hurricane. The followed users include official accounts for local newspapers, different government agencies that play roles in disasters, and NGOs that provide disaster relief. A complete list of the selectors used and a diagram of the bounded region are present in Appendix A.

The second set of tweets consists of 4,288,878 tweets by 2,514,529 users, collected between October 30, 2012, 1:13 AM and November 1, 2012, 8:10 PM [66]. Tweets were collected if they used any of 10 keywords, chosen by CMU graduate students to capture both reports of events related to Sandy and events occurring in regions afflicted by Sandy. The decisions about particular keywords were made using surface level criteria (e.g. the name of the place or the hurricane). All of the keywords are listed in Appendix B.

Combined, and omitting six stray tweets captured before 8:00 PM on the first day, there are 8,529,999 unique tweets in the data, distributed over the ten day period as shown below; the most tweets recorded in any one hour was 119,603. Peaks occurred on October 30 and 31, separated by a brief, sharp trough in the early morning of the 31st; an average of 34.34% of tweets seen contain hashtags, and only 3.8% have latitudes and longitudes. Further, the distribution of these tweets is skewed towards the beginning of the hurricane, as shown in Figure 2. This is likely an artifact of the bounding box mentioned earlier, but the scale of the impact on the sample is surprising.

120000 100000 80000 60000 40000 20000 0 11/1 2 PM 11/1 8 PM 11/2 4 PM 11/3 11/1 4 AM 4 11/1 AM 6 11/3 10/26 6 PM10/26 2 PM10/27 8 PM10/28 4 PM10/29 6 PM10/31 11/2 12 AM 12 11/2 AM 10 11/2 10/26 8 AM 8 10/26 AM 4 10/27 AM 6 10/29 AM 2 10/30 AM 8 10/31 10/30 12 PM10/30 10 PM10/30 10/28 12 AM 12 10/28 AM 10 10/28

Figure 1 A plot of all the tweets collected during Hurricane Sandy, aggregated by hour

10

Peter M. Landwehr Thesis Proposal

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 11/1 5 PM 11/1 8 PM 11/2 2 PM 11/3 11/1 8 AM 8 11/1 AM 2 11/2 AM 5 11/3 11/3 11 PM 11 11/3 10/26 5 PM10/26 8 PM10/27 2 PM10/28 5 PM10/29 8 PM10/30 2 PM10/31 11/2 11 AM 11 11/2 10/26 8 AM 8 10/26 AM 2 10/27 AM 5 10/28 AM 8 10/29 AM 2 10/30 AM 5 10/31 10/28 11 PM 11 10/28 PM 11 10/31 10/27 11 AM 11 10/27 AM 11 10/30

Figure 2 Proportion of collected tweets with latitudes and longitudes.

4.3 The 2012 Colorado Wildfires Colorado Governor John Hickenlooper called the summer of 2012 “the worst fire season in the history of Colorado.” Numerous fires broke out during the summer, forcing individuals to evacuate from their homes; the Waldo Canyon forced approximately 32,000 people to flee Colorado Springs [67]. Between late March and the end of July at least 16 different fires were logged in the Inciweb interagency disaster reporting system [68]–[83]. The Denver Post noted 12 different fires [84]. According to the Post, only four of those fires – High Park, Waldo Canyon, Last Chance, and Lower North Fork - were responsible for destroying homes, but razed 637. All 12 burned more than 213,000 acres. Figure 1 shows the overlapping time periods of the different fires, each one of which requires teams of firefighters to contain.

11

Peter M. Landwehr Thesis Proposal

Figure 3 Durations of 16 large forest fires in Colorado during the summer of 2012. The majority of the collected tweets are discussing the Waldo Canyon, Lower North Fork, and High Park Fires. (Rows 1, 6, and 12.) 4.4 Tweets collected during the 2012 Colorado Wildfires Tweets from the Colorado Wildfires were collected by Shamanth Kumar at Arizona State University using a set of nine keywords intended to capture general information about the ongoing disaster. A total of 11,209 tweets by 6,246 users were captured, from between March 29, 10:45 PM and July 31, 2:41 PM; 9,000 tweets (over 80%) were captured between June 28 and July 2, the weekend of Colorado Springs’ evacuation for the . (See graph below.) The keywords used to collect the wildfire data are listed in Appendix C. They refer to the wildfires generally (“#colorado #fires”) but make specific reference to the Waldo Canyon, High Park, and Lower North Fork fires. During the peak capture period, 6,435 captured tweets containing hashtags; the top five hashtags collected over the entire time period are show in the table below. While the most commonly used hashtag relates to the Waldo Canyon Fire, the second most common represents the Lower North Fork Fire. While dwarfed by the former, our data does contain large numbers of tweets from two forest fires.

Hashtag (lowercased) Tweets using hashtag waldocanyonfire 5300 lowernorthforkfire 782 highparkfire 405 cofire 316 cofires 212 Table 1 The five most frequently used hashtags in the wildfire data.

12

Peter M. Landwehr Thesis Proposal

3200 2800 2400 2000 1600 1200 800 400 0 3/29/12 4/29/12 5/29/12 6/29/12 7/29/12

Figure 4 A graph of the wildfire tweets collected, aggregated by day. 80.3% of tweets were collected between June 28 and July 2

4.5 Additional Tweets I may supplement these collections of tweets with two additional sources. The first of these is with any tweets that I collect from disasters that occur while carrying out my thesis research. Additional data from how people respond to hurricanes or wildfires would be useful in increasing the size of the disaster pool from which the simulation will be built.

The second source for additional tweets would be an unofficial, archived collection of tweets scraped by CMU researchers from the twitter stream. These tweets were collected using the Twitter API without specifying any particular set of keywords. That is, they represent a general mix of what content might have been getting discussed in the moment during a particular disaster, albeit not specifically about a disaster. Captured tweets that incidentally use the keywords specified in the different data sets or other keywords related to Sandy or the wildfires could be safely used to increase the quantity of data.

5 Methodology In this section, I describe the methodology that I will use to go from a collection of raw, unclassified tweets to a simulation of the tweets produced during two different types of disasters. In general, the process spans coding a set of tweets with Vieweg- style codings to use as a gold-standard for a classifier (and experimenting with different methods of creating a gold-standard coding), implementing a classifier for accurately labeling tweets with gold standard codings, analyzing the tweets produced during disasters based on these codings, and developing a simulation based on the results of the analysis.

5.1 Creating a Gold-Standard for Tweets In Vieweg’s thesis, she worked with two linguistics students to code several thousand tweets from different disasters using a new coding scheme. In order to simplify, the task, she and the raters applied the codings in three passes. In the first pass, the raters identified tweets containing information related to situational awareness (SA). In the second, they labeled SA tweets as relating to the social, built, or physical environment (S, B, and P). In the third pass, they applied a subset of at most five labels to each tweet, where particular labels exist underneath each category.

13

Peter M. Landwehr Thesis Proposal

My coding process is derived from that used by Vieweg, but is also intended to test whether it is possible to use crowd workers instead of trained linguistics students to apply the same labels. In her research, she used approximately 1,000 SA tweets from each disaster. I will use 1,000 as a baseline size for my own collection of gold tweets from each disaster. This will likely be enough tweets to train a classifier that can identify the different, top-level categories of SA tweets, but may be inadequate for developing a classifier that can pick out the different sub-categories. If necessary, I will put additional tweets through the coding process in order to obtain more gold. The three passes that will be made over the collected data are described below They should be understood as being split into two phases. In the first, both linguistics students and crowd workers will code 250 tweets. If the two groups perform similarly, in the second coding phase I will only use crowd workers for further coding. If the results differ, I will only use students. The “linguistics students” should be understood to be a pair of linguistics students that I will hire and train to code for each pass. I will break any “ties” that arise during the coding process. The “crowd workers” should be understood to be an undifferentiated mass of Americans who speak English. They receive training immediately before labeling a particular set of tweets during any coding pass. I currently anticipate that every job on the crowdsourcing platform will contain ten tweets.

5.1.1 The Situational Awareness Pass The first pass on the data will be devoted to identifying a set of tweets that exhibit situational awareness. This is a soft category, and one that may be difficult for crowd workers to apply effectively. I will instruct the two different worker groups about how to identify whether or not a particular tweet in a disaster contains situational awareness data, and ask them to code tweets accordingly. As stated above, I would like to get approximately 1,000 gold-standard tweets for each disaster. When Vieweg coded tweets from the 2010 Haitian Earthquake, she used a sample of 5,000 to successfully extract 1,000 SA tweets. Unlike Vieweg’s sample, which represented the tweets sent by users who used disaster-related keywords at least three times, my data represents the sort of information found by relief groups when searching for keywords. It indicates information that is on the “surface” of a disaster. As such, I hypothesize that the ratio of SA to non-SA tweets will be fairly high; as such, I see the 5,000 number as upper bound for estimating costs.

5.1.2 The Categorization Pass In the second pass, tweets determined to relate to situational awareness are coded based on whether they contain information related to the social, physical, or built environment. These labels aren’t exclusive and a given tweet may possess any combination of them. Unlike the identification of situational awareness information, training crowd workers for this task should be relatively straightforward; Vieweg’s thesis provides paragraph-length descriptions of the different categories that can be used as the basis of training. However, there is a reasonable threat that cryptic references to particular kinds of damage will be hard for crowd workers to interpret, so the training material may need to account for this. However, given the surface nature of our data, the appropriate categories may be readily apparent.

5.1.3 The Labeling Pass In the third pass, the SA tweets are each assigned up to five labels based on the different categories to which they have been previously assigned. Each particular category is affiliated with a set of different potential labels: 4 for Built Environment, 7 for Physical Environment, and 24 for Social Environment. The linguistics students will be expected to be able to work with any

14

Peter M. Landwehr Thesis Proposal number of these labels at any one time. The same cannot be said for the crowd worker, who may not have the same capabilities or patience, especially given the relatively low payment for performing the task. Tweets labeled exclusively with the Built and Physical Environment codes will pose no problem, as the number of codes is small enough to be handled by a single worker. If a tweet has been labeled as relating to the Social Environment or any combination of the different categories, the number of labels will be more than can be trusted to a single crowd worker. There are several ways to design the labeling task to solve this problem. The most straightforward would be to present ordinal subset of labels to crowd workers, have them select which ones apply, and then continue to have run-off votes until one label is chosen. For example, if a tweet was assigned the B and P labels, the combined 11 would be broken up as (1-5), (6-11), a worker assigned to a subset can choose any number of the labels in the subset to assign to the tweet. If the workers choose more than five labels for a tweet, an additional crowd worker would be presented with all of the chosen answers and told to pick five. This method doesn’t control for the possibility that the labels in a particular group may bias the random worker to choose those labels, or the fact that, depending on the proclivities of the workers, the tie breaking round may include just as many labels as were present from the start. My current preference is to address this problem by breaking up the appropriate group of labels into a number of random subsets, such that that every label is guaranteed to appear in at least K subsets. One crowd worker will be assigned to each subset. I have written a preliminary program that generates such subsets. For the 11-label case with K = 2 and letting crowd workers choose between seven categories, it produces: (1, 2, 3, 7, 8, 9, 11), (3, 5, 6, 7, 8, 9, 10), (3, 4, 5, 6, 9, 10, 11), and (1, 2, 3, 4, 5, 9, 11). A crowd worker would be assigned each of these options. The majority of votes for each label would decide whether it should be applied. In the event of more than five labels having an equal number of possibilities, the set of labels can be split using the algorithm and voted on, and if the number of tied entries is 7 or less, a single crowd worker could be used to break the tie by selecting an aggregate top 5. Combination Number Combinations Combinations of Categories of Labels when K = 3 when K = 5 B and P 11 5 9 S 24 13 21 S and B 28 16 25 S and P 31 18 29 S, B, and P 37 23 35 Table 2 Combinations of category labels, the number of labels in the combination, and the number of random subsets needed to have K = 3 and 5 when putting seven labels in each combination.

5.2 Developing a Classifier The particular classifier that I am planning to develop will use supervised learning to determine appropriate Vieweg-style labels for uncoded tweets. Just as the coding process has three tiers, so too can classification be potentially considered a three- tiered process. The first tier, labeling tweets as containing situational awareness information, has been addressed in prior work by Verma et al. The researchers were able to correctly classify tweets from several disasters between 83.5% and 88.8% of the time using a MaxEnt classifier, unigrams, part-of-speech tags, subjectivity coding, register, and tone; bigrams provided no useful improvement [8]. The last four labels were applied using a classifier trained during the course of the experiment to identify these features. The second- and third-tier forms of classification (category labels and individual labels) lack any such clear prior. Vieweg demonstrated the utility of using VerbNet as a basis for applying category labels, but individual labels remained outside the scope of her work. My current plans for training include using at least unigram text features, subjectivity labels, POS tags, and

15

Peter M. Landwehr Thesis Proposal verbs and VerbNet classes , though others may be added as needed. While I plan to experiment with several different supervised learning algorithms, I will make a point of using both MaxEnt, because of its prior success, and support vector machines, because of their established usage on Twitter data and track record [5], [46], [85]. As mentioned earlier, the 1,000 tweet baseline may be insufficient for training a useful classifier. In Vieweg’s work, the distribution of tweet labels in each of her data sets was notably skewed. Fifteen labels are referenced in fewer than 50 tweets; thirteen are referenced in less than 1% of the collected tweets. Training a model with only fifty positive examples is likely untenable. While increasing the quantity of coded data may help to address this, it is also quite possible that in the end I will only be able to use the system to automatically apply a subset of established labels. That said, if I can effectively use classifiers to find tweets that definitively do not possess any of the more frequent labels but do possess situational awareness, I can take these tweets back to the student coders or crowd workers to create additional training data for these underrepresented categories. To successfully complete the classification phase of this thesis, I will need to be able to apply all of Vieweg’s labels that are used in a non-trivial number of tweets with a reasonable degree of success, which I would define as at least 80%. If I cannot achieve this end, in order to successfully carry out the analysis described in the next section I will return to using crowd workers or linguistics students to code an additional, large set of tweets and proceed from there.

5.3 Data Analysis The purpose of this analysis will be to describe how different disasters are represented on social media over the course of their duration. I will use the classifier developed in the previous section to code all of the tweets that we have collected with the appropriate Vieweg-style labels. I will then compare how the frequencies of the different categories co-occur relative to toe different critical events in the disaster. My analysis will also look at a variety of other features: Hashtags. Hashtags are often understood as shorthand for a topic or theme in a tweet. Certain hashtags are often seen as indicative of a particular disaster (e.g. “#sandy” during Hurricane Sandy), but it can take time for those associated with a particular event to coalesce. By looking at the frequencies of hashtags with different tweets, I hope to determine how certain hashtags can be used to approximate situational awareness, and how deviations from expected frequencies can indicate Mentions of other users. The disaster research literature discusses how individuals try to stay in contact with their loved ones during disaster events. While the data doesn’t allow us to look at the granular activity of individual Internet users, it does allow us to look at whether the collected surface tweets reflect this particular underlying social reality. Retweets. One of the most common activities in the Twitter space is retweeting; users share messages posted by others, sometimes adding in their own thoughts and comments. While other researchers have looked at the likelihood that individuals will retweet rumors that have been established as true or false during disaster, it’s unclear how much retweeting habits are altered by a disaster event. Do single tweets become heavily retweeted and dominate the disaster’s story? Do individuals share personal information more often? Location. I expect that tweets produced during a disaster will reflect the particular characteristics of the different afflicted regions from which the tweets derive. Only a fractional number of tweets are geocoded [24]. As such, a reasonable analysis of the impact of location may be impossible. Assuming that such an analysis can be conducted, however, I will attempt to correlate the particular location of each tweet with the disaster’s epicenter, as well as any particular conditions afflicting the region from which the tweet is coming. (For instance, if power has been lost in a particular area in New York.) The findings of my analysis will be contingent on my ability to determine the presence of such conditions in the areas surrounding the disaster regions.

16

Peter M. Landwehr Thesis Proposal

Time of day. Twitter usage patterns are known to vary based on the time of day. A disaster can have fairly dramatic impacts on the schedules kept by different individuals. Small events, like power outages or evacuation notices can fail to be detected by the population if they occur at night. My analysis should attempt to differentiate between the impact of the particular time of day and that of the disaster’s consequences on individuals’ tweeting. Events. Based on background reading about the disasters, as well as the codings and the other tweet features that we are also examining in this section, can we tie directly tie particular blocks of tweets to the occurrence of particular local events related to the disaster such as local flooding, power outages, or official announcements? Type of user. An individual’s perception of a Twitter user’s role can have an impact on their perception of the quality of their provided content. To both account for this in simulation as well as to simply understand the impact of different users’ role on their behavior when using Twitter, I will code a number of individuals with role information and look at how it compares with their activity. In order to determine how posting frequency and role are correlated, I will first bucket all the accounts in each data set into quintiles based on their activity. I will then randomly sample a percentage of each bucket and code those users with information about their roles. A t present, my plan is to use three role labels: individual, member of the media or media organ, and member of a relief organization or a relief organization.

5.4 Creating a Simulation The final component of the thesis will be two simulations based on the results of the analysis of each disaster in order to facilitate disaster training and preparation. One simulation will produce characteristic pseudo-tweets seen during a Colorado Wildfire, the other characteristic pseudo-tweets produced during a hurricane striking New York City; each simulation will take as input a set of simplified, additional parameters to specify the disaster’s severity (e.g. peak hurricane wind speed) and a total population of twitter accounts. Currently, each pseudo-tweet is planned to include:

• Timestamp • Type of User • Vieweg-style categorization • Generalized disaster-related hashtags (e.g. #disasterTag1, #disasterTag2) • List of mentioned users • User classification • Location (if tweet is determined to possess a geocode) • Note if this is a retweet, and if so situate as such by noting its originating tweet.

Currently, I am planning to construct the simulation algorithm around the disaster’s intensity, the types of users, and the likelihood of different discrete events occurring within the scope of the disaster. Such events would include power outages, evacuation orders from the government, and other damage to property. During analysis, I’ll locate some information about these by reading primary source materials related to each disaster.

When running the simulation, it will begin by initializing a number of virtual Twitter users identified as individuals, the media, or relief NGOs; other types of users may be used if they fell out of the analysis of different types of users. At every time period, there will be some probability that a particular disaster event will occur. Correspondingly, there will also be a probability

17

Peter M. Landwehr Thesis Proposal that each user may emit a pseudo-tweet. This probability will be conditioned on the time since the start of the disaster, the time of day, the types of events seen during the disaster and the time since the most recent one, and the type of user.

As noted, the simulation will take input parameters to determine the intensity of the disaster. Because of Hurricane Sandy’s exceptional strength, it will exist at the far end of the spectrum; Pseudo-tweets produced by less-intense hurricanes will have lower probabilities of occurring. The data from the Colorado wildfire season contains information from multiple fires, so should be more readily scalable to different levels of fire threat. Additional data would still be welcome.

5.5 Validating Simulated Results To validate the simulation, I will attempt to use it to precisely replicate the source data and also reach out to subject matter experts in the relief field for their opinions about methodological soundness and the results being produced.

If the simulation as described is calibrated with all of the settings found during the analysis, then it should produce results that accord with those seen in the actual disaster. That is to say, if instead of probabilistically distributing the events seen in the disaster the model is set to make them appear at the precise moments at which they are logged as occurring during the analysis, and if the mix of different types of users is set to match that recorded in the data, the results produced by the model on repeated runs should match those seen during both recorded disasters. This form of validation is relatively weak, as it requires calibrating for the one specific event from which the entire model has been derived.

To supplement this, I will reach out to the disaster relief community and attempt to contact workers or volunteers affiliated with the technical community to assess their opinions of the methodology underlying the model and the generated output. If they believe that the principles and results make sense in the context of their work, then I will consider that reasonable evidence that the simulation is valid. Correspondingly, it is conceivable that a critique from the community will result in my rethinking the simulation’s features. If doing so increases utility, then that will be effort appropriately spent for achieving the practical end that is this thesis’s stated goal.

6 Limitations There are several important limitations on this thesis research, and these constraints are important to understand. Notably, the data that I will be using is focused on the “surface” of disasters, only describing tweets with particular superficial features related to the disaster, the tweets come from a sample population of just two disasters, and developing a classifier for more than a small number of tweet labels may be difficult.

In the first case: the selectors used to collect the data from Hurricane Sandy the wildfire season were chosen to acquire tweets from relief services and that most obviously used particular hashtags. While certain users appear frequently in the data, many others show up only one or two times. As such, the data and the corresponding simulation can’t be considered a balanced representation of all the data being produced by users during disaster. Rather it only represents the tweets made by users that they explicitly want associated with each disaster, and not with any follow-up content. Correspondingly, the analysis and simulation won’t represent the totality of tweets produced in disaster, but rather a selective portrait framed by the collected tweets.

18

Peter M. Landwehr Thesis Proposal

The second limitation, that the data comes from only a few sources, limits my ability to generalize results. Hurricane Sandy was an immense disaster that caused widespread devastation; it was not a typical hurricane. Having only one data point decreases my ability to validate the simulation as a good representation of a hurricane. Similarly, the wildfire data, while spanning a whole season, is largely concentrated around only a few fire events. While this provides a somewhat stronger basis for validation, more data would be useful.

The third limitation, while also a product of the small amount of data, can be blamed on the nature of tweets produced during disaster. While the sampling method for my tweets differs significantly from that used by Vieweg, it is possible that my data, like hers, will have only small samples of a subset of labels. This can be partly addressed by consistently applying effective classifiers to find tweets that are unlabeled and then manually coding them to increase the size of the data pool. Nonetheless, successful classification of some labels may prove impossible.

7 Final Results, Contributions and Extensions After completing each step of the methodology described above, I will be able to characterize the thesis as having made the contributions listed in section 4. To wit: I will have established the feasibility of using a crowdsourcing platform to apply Vieweg-style codings; I will have developed a new classification scheme that can be used to label tweets with Vieweg-style codings to tweets collected using keywords related to disaster; I will have analyzed and described how two different disasters have been represented on Twitter; and I will have created a simulation of the patterns of tweets created during two hurricanes and wildfires.

The principal purpose of this thesis is to help Twitter data be better used in relief situations. In the Contributions section, I mentioned three research questions. The first of these was “How can we inexpensively and rapidly classify tweets so that relief workers can use them?” The experiment with using both crowd sourcing and machine learning to successfully apply Vieweg’s labels to unclassified data speaks to this point. In her own work, Vieweg argued that particular label codes can be inferred to be of potential interest to particular relief services. By improving the speed and accuracy with which these labels can be applied, I will be helping to move them into practical use for helping with disasters.

The second research question was “How are disasters represented in social media?” Other researchers have attacked this problem, often by collecting tweets collected during disasters and then attempting to qualitatively describe observed content. My analysis will contribute to this body of work, but by applying Vieweg’s labels to all of the tweets collected for these disasters and looking at their change over time I will attempt to place a new spin on it.

My final research question was “How can we usefully reproduce the content produced on social media during disasters?” This question is addressed by both my completion of the proposed simulation and my validating both its ability to reproduce input data and its utility by speaking to relief workers. A completed simulation should be designed to help relief organizations prepare for disasters by exposing them to the types of tweets that they might expect to see at the rates at which they might expect to see them.

19

Peter M. Landwehr Thesis Proposal

There are several ways in which this thesis’s contributions might be extended. As described in Limitations, even after the work is completed there will be a need to refine the simulation by incorporating additional disaster information. Given the likely imbalance in the numbers of tweets present in the disaster, additional work may be useful in

Assuming that crowdsourced application of different codings to tweets works correctly, both it and the classifier will be potentially extendable as actual tools used to process tweets as a disaster proceeds. The analysis will contribute to future research that seeks to understand how the world works based on contributions to social media without using real-world information.

8 Timeline for Completion I’ve divided the work into five main steps, one for each contribution and one for writing whatever remains. The steps are split into thirty sub-steps, some of which can be completed in overlapping periods of time. As it is, I estimate that it will take approximately one year to complete. The thesis document itself will be split into six sections: one for each contribution, an introduction, and a conclusion. An outline and Gantt chart of the process are provided below. Note that the first, second, and third components have a reasonable degree of overlap; analysis of certain tweet features can begin before all the data has been crowdsourced, and attempts to classify the data at a particular tier can be begun after that tier of gold has been created.

1. Refining Tweets to Gold a. Creating training materials and software (5 weeks) b. Recruiting students and performing the first pass, phase one (4 weeks) c. Recruiting crowd workers and performing the first pass, phase one (2 weeks) d. First pass, phase two (2 weeks) e. Using students and recruiting crowdworkers to perform the second pass, phase one (3 weeks) f. Second pass, phase two (2 weeks) g. Using students and recruiting crowd workers to perform the third pass, phase one (3 weeks) h. Third pass, phase two (2 weeks) i. Writing a chapter (3 weeks) 2. Developing a Classifier a. Applying additional features to tweets (4 weeks) b. Classify for Situational Awareness (4 weeks) c. Classify for category (4 weeks) d. Classify for particular labels (4 weeks) e. Writing a chapter (3 weeks) 3. Analysis a. Vieweg’s labels (2 weeks) b. Different types of users and their relative activity (3 weeks) c. Events (3 weeks) d. Hashtags, mentions, retweets, location, date and time (5 weeks)

20

Peter M. Landwehr Thesis Proposal

e. Writing a chapter (3 weeks) 4. Simulation a. Contact external SMEs (3 weeks) b. Develop the basic simulation model (5 weeks) c. Developing the hurricane simulation (6 weeks) d. Developing the wildfire simulation (6 weeks) e. Contact the SMEs and validate (4 weeks) f. Writing a chapter (3 weeks) 5. Additional writing a. Write introduction (2 weeks) b. Write conclusion (2 weeks) c. Prepare for defense (6 weeks)

21

Peter M. Landwehr Thesis Proposal

8 References [1] D. Carr, “How Hurricane Sandy Slapped the Sarcasm Out of Twitter,” New York Times: Media Decoder, 31- Oct-2012. . [2] J. Sutton, “Twittering Tennessee: Distributed Networks and Collaboration Following a Technological Disaster,” in Proceedings of the 7th International Conference on Information Systems for Crisis Response and Management, Seattle, Washington, USA, 2010. [3] L. A. St. Denis, A. L. Hughes, and L. Palen, “Trial by Fire: The Deployment of Trusted Digital Volunteers in the 2011 Shadow Lake Fire,” in Proceedings of the 9th International Conference on Information Systems for Crisis Response and Management, Vancouver, British Columbia, Canada, 2012. [4] Harvard Humanitarian Initiative, “Disaster Relief 2.0: The future of Information Sharing in Humanitarian Emergencies,” Harvard Humanitarian Initiative, UN Office for the Coordination of Humanitarian Affairs, United Nations Foundation, 2011. [5] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users: real-time event detection by social sensors,” in Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 851–860. [6] S. Asur and B. A. Huberman, “Predicting the Future with Social Media,” presented at the Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, 2010, vol. 1, pp. 492–499. [7] M. Naaman, J. Boase, and C.-H. Lai, “Is it really about me?: message content in social awareness streams,” in Proceedings of the 2010 ACM conference on Computer supported cooperative work (CSCW), Savannah, Georgia, USA, 2010, pp. 189–192. [8] S. Verma, S. E. Vieweg, W. J. Corvey, L. Palen, J. H. Martin, M. Palmer, A. Schram, and K. M. Anderson, “Natural Language Processing to the Rescue? Extracting ‘Situational Awareness’ Tweets During Mass Emergency,” in Proceedings of the 2011 International AAAI Conference on Weblogs and Social Media, 2011. [9] S. Kumar, G. Barbier, M. A. Abbasi, and H. Liu, “TweetTracker: An Analysis Tool for Humanitarian and Disaster Relief,” in Proceedings of the 2011 International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, 2011, pp. 661–662. [10] M. A. Cameron, R. Power, B. Robinson, and J. Yin, “Emergency situation awareness from twitter for crisis management,” in Proceedings of the 21st international conference companion on World Wide Web, New York, NY, USA, 2012, pp. 695–698. [11] O. Okolloh, “Ushahidi, or ‘testimony’: Web 2.0 tools for crowdsourcing crisis information,” Particip. Learn. Action, vol. 59, no. 1, pp. 65–70, Jun. 2009. [12] E. L. Quarantelli, “A Half Century of Social Science Disaster Research: Selected Major Findings and Their Applicability,” University of Delaware, Newark, Delaware, 2003. [13] M.-A. Abbasi, S. Kumar, J. A. A. Filho, and H. Liu, “Lessons Learned in Using Social Media for Disaster Relief - ASU Crisis Response Game,” presented at the International Conference on Social Computing, Behavioral- Cultural Modeling, and Prediction, College Park, Maryland, 2012. [14] S. E. Vieweg, “Situational Awareness in Mass Emergency: A Behavioral and Linguistic Analysis of Microblogged Communications,” University of Colorado at Boulder, Boulder, Colorado, USA, 2012. [15] V. Hester, A. Shaw, and L. Biewald, “Scalable crisis relief: Crowdsourced SMS translation and categorization with Mission 4636,” in Proceedings of the First ACM Symposium on Computing for Development, London, United Kingdom, 2010, pp. 1–7. [16] P. Earle, M. Guy, R. Buckmaster, C. Ostrum, S. Horvath, and A. Vaughan, “OMG Earthquake! Can Twitter Improve Earthquake Response?,” Seism. Res. Lett., vol. 81, no. 2, pp. 246–251, 2010. [17] B. De Longueville, R. S. Smith, and G. Luraschi, “‘OMG, from here, I can see the flames!’: a use case of mining location based social networks to acquire spatio-temporal data on forest fires,” in Proceedings of the 2009 International Workshop on Location Based Social Networks, New York, NY, USA, 2009, pp. 73–80. [18] S. Sato, M. Tatsubori, and F. Imamura, “Mass and social media corpus analysis after the 2011 great east Japan earthquake,” in Proceedings of the 21st international conference companion on World Wide Web, New York, NY, USA, 2012, pp. 711–712.

22

Peter M. Landwehr Thesis Proposal

[19] A. Kongthon, C. Haruechaiyasak, J. Pailai, and S. Kongyoung, “The role of Twitter during a natural disaster: Case study of 2011 Thai Flood,” in Technology Management for Emerging Technologies (PICMET), 2012 Proceedings of PICMET ’12:, Vancouver, BC, Canada, 2012, pp. 2227–2232. [20] S. K. Ploeger, G. M. Atkinson, and C. Samson, “Applying the HAZUS-MH software tool to assess seismic risk in downtown Ottawa, Canada,” Nat. Hazards, vol. 53, no. 1, pp. 1–20, 2010. [21] F. Morstatter, J. Pfeffer, H. Liu, and K. M. Carley, “Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose,” in Proceedings of the 2013 International AAAI Conference on Weblogs and Social Media, Boston, Massachusetts, USA, 2013. [22] N. Arcenaux and A. S. Weiss, “Seems stupid until you try it: press coverage of Twitter, 2006-9,” New Media Soc., vol. 12, no. 8, pp. 1262–1279, 2010. [23] Twitter, “#numbers,” Twitter Blog, 14-Mar-2011. . [24] Semiocast, “Twitter reaches half a billion accounts; More than 140 millions in the U.S.,” Semiocast Publications, 30-Jul-2012. . [25] Semiocast, “Brazil becomes 2nd country on Twitter, Japan 3rd; Netherlands most active country,” Semiocast Publications, 31-Jan-2012. . [26] V. Lipman, “The World’s Most Active Twitter Country? (Hint: Its Citizens Can’t Use Twitter),” Forbes, 01- May-2013. [27] R. Krikorian, “New tweets per second record, and how!,” Twitter Blog, 16-Aug-2013. . [28] A. Smith and J. Brenner, “Twitter Use 2012,” Pew Research Center, Washington, DC, USA, Pew Internet & American Life Project, May 2012. [29] A. Java, X. Song, T. Finin, and B. Tseng, “Why we twitter: understanding microblogging usage and communities,” in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, San Jose, California, 2007, pp. 56–65. [30] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social network or a news media?,” in Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 591– 600. [31] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts, “Everyone’s an influencer: quantifying influence on twitter,” in Proceedings of The Fourth ACM International Conference on Web Search and Data Mining, Hong Kong, China, 2011, pp. 65–74. [32] R. R. Dynes, Organized Behavior in Disaster. Heath, 1970. [33] D. E. Wenger, J. D. Dykes, T. D. Sebok, and J. L. Neff, “It’s a matter of myths: An empirical examination of individual insight into disaster response,” Mass Emergencies, vol. 1, no. 1, pp. 33–46, 1975. [34] W. H. Form and S. Nosow, Community in disaster. Harper, 1958. [35] D. Mendoça, T. Jefferson, and J. Harrald, “Collaborative adhocracies and mix-and-match technologies in emergency management,” Commun ACM, vol. 50, no. 3, pp. 44–49, Mar. 2007. [36] I. Shklovski, L. Palen, and J. Sutton, “Finding community through information and communication technology in disaster response,” in Proceedings of the 2008 ACM conference on Computer supported cooperative work, San Diego, CA, USA, 2008, pp. 127–136. [37] I. Shklovski, M. Burke, S. Kiesler, and R. Kraut, “Technology Adoption and Use in the Aftermath of Hurricane Katrina in New Orleans,” Am. Behav. Sci., vol. 53, no. 8, pp. 1228–1246, Feb. 2010. [38] R. Munro, “Crowdsourcing and the crisis-affected community: Lessons learned and looking forward from Mission 4636,” Inf. Retr., vol. 16, no. 2, pp. 210–266, Apr. 2013. [39] A. Sarcevic, L. Palen, J. White, K. Starbird, M. Bagdouri, and K. Anderson, “‘Beacons of hope’ in decentralized coordination: learning from on-the-ground medical twitterers during the 2010 Haiti earthquake,” in Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW), Seattle, Washington, USA, 2012, pp. 47–56. [40] D. Sullivan, “Tracking Hurricane Sandy News Through Twitter,” Marketing Land, 30-Oct-2012. . [41] M. Mendoza, B. Poblete, and C. Castillo, “Twitter under crisis: can we trust what we RT?,” in Proceedings of the First Workshop on Social Media Analytics (SOMA), Washington D.C., District of Columbia, 2010, pp. 71–79.

23

Peter M. Landwehr Thesis Proposal

[42] J. Pfeffer, T. Zorbach, and K. M. Carley, “Understanding online firestorms: Negative word of mouth dynamics in social media networks,” J. Mark. Commun., Forthcoming 2013. [43] S. Sinnappan, C. Farrell, and E. Stewart, “Priceless Tweets! A Study on Twitter Messages Posted During Crisis: Black Saturday,” in Proceedings of the 2010 Australasian Conference on Information Systems (ACIS), 2010. [44] Y. Qu, C. Huang, P. Zhang, and J. Zhang, “Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake,” in Proceedings of the ACM 2011 conference on Computer supported cooperative work (CSCSW), Hangzhou, China, 2011, pp. 25–34. [45] K. Starbird, L. Palen, A. L. Hughes, and S. E. Vieweg, “Chatter on the red: what hazards threat reveals about the social life of microblogged information,” in Proceedings of the 2010 ACM conference on Computer supported cooperative work, Savannah, Georgia, USA, 2010, pp. 241–250. [46] K. Starbird, G. Muzny, and L. Palen, “Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disruptions,” in Proceedings of the 9th International Conference on Information Systems for Crisis Response and Management (ISCRAM), Vancouver, British Columbia, Canada, 2012. [47] M. A. Finney, “FARSITE: Fire Area Simulator–model development and evaluation,” U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station, Ogden, Utah, Research Paper RMRS-RP-4, Feb. 2004. [48] M. A. Finney and K. C. Ryan, “Use of the FARSITE fire growth model for fire prediction in US National Parks,” in International emergency management and engineering conference, edited by JD Sullivan, JL Wybo, and L. Buisson. Paris, France: International Emergency Management and Engineering Society, 1995. [49] E. K. Noonan, “A coupled model approach for assessing fire hazard at point Reyes national seashore: FlamMap and GIS,” in Second international wildland fire ecology and fire management congress and fifth symposium on fire and forest meteorology, Orlando, FL. American Meteorological Society, 2003, pp. 127–128. [50] P. J. Vickery, P. F. Skerlj, J. Lin, L. A. Twisdale, M. A. Young, and F. M. Lavelle, “HAZUS-MH Hurricane Model Methodology. II: Damage and Loss Estimation,” Nat. Hazards Rev., vol. 7, no. SPECIAL ISSUE: Multihazards Loss Estimation and HAZUS, pp. 94–103, 2006. [51] S.-C. Chen, M. Chen, N. Zhao, S. Hamid, K. Chatterjee, and M. Armella, “Florida public hurricane loss model: Research in multi-disciplinary system integration assisting government policy making,” Gov. Inf. Q., vol. 26, no. 2, pp. 285–294, 2009. [52] J. F. Remo and N. Pinter, “Hazus-MH earthquake modeling in the central USA,” Nat. Hazards, vol. 63, no. 2, pp. 1055–1081, 2012. [53] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,” Internet Math., vol. 6, no. 1, pp. 29–123, 2009. [54] D. M. Romero, B. Meeder, and J. Kleinberg, “Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter,” in Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 2011, pp. 695–704. [55] R. Axelrod, “The Dissemination of Culture: A Model with Local Convergence and Global Polarization,” J. Confl. Resolut., vol. 41, no. 2, pp. 203–226, 1997. [56] T. C. Schelling, “Models of Segregation,” Am. Econ. Rev., vol. 59, no. 2, pp. 488–493, May 1969. [57] P. Landwehr, M. Spraragen, B. Ranga, K. M. Carley, and M. Zyda, “Games, Social Simulations, and Data— Integration for Policy Decisions: The SUDAN Game,” Simul. Gaming, Sep. 2012. [58] C. Schreiber, S. Singh, and K. M. Carley, “Construct - A Multi-agent network model for the co-evolution of agents and socio-cultural environments,” Carnegie Mellon University, Pittsburgh, PA, CASOS Technical Report CMU-ISRI-04-109, May 2004. [59] E. S. Blake, T. B. Kimberlain, R. J. Berg, J. P. Cangialosi, and J. L. Beven II, “Hurricane Sandy,” National Hurricane Center, Tropical Cyclone Report AL182012, Feb. 2013. [60] CNN Library, “Hurricane Sandy Fast Facts,” CNN, 13-Jul-2013. . [61] J. Fenster and J. Swift, “Hurricane Sandy turned ‘Frankenstorm’ may be headed for Connecticut next week,” The Middletown Press, 25-Oct-2012.

24

Peter M. Landwehr Thesis Proposal

[62] S. E. Cohen, “Sandy Marked a Shift for Social Media Use in Disasters,” Emergency Management, 07-Mar- 2013. [63] M. Sledge, “Hurricane Sandy Damage in New York By-The-Numbers,” The Huffington Post. . [64] S. Watson, “Hurricane Sandy victims: Recovery slowed by red tape and lack of information,” The Press of Atlantic City, Pleasantville, New Jersey, 15-Aug-2013. [65] S. Kumar, “sampling methods for wildfire & sandy tweets?,” 29-Aug-2013. [66] B. Chang, W. Frankenstein, and B. Yang, “#sandy.” . [67] K. Coffman, “Colorado wildfire expands viciously, Obama plans visit,” Reuters, Colorado Springs, Colorado, 27-Jun-2012. [68] Souther Ute Agency, “Air Park,” InciWeb, 25-Jul-2012. . [69] Bureau of Land Management, “Pine Ridge,” InciWeb, 04-Jul-2012. . [70] Colorado State Forest Service, “Flagstaff Fire,” InciWeb, 30-Jun-2012. . [71] U.S. Forest Service, “Treasure,” InciWeb, 01-Oct-2012. . [72] U.S. Forest Service, “Waldo Canyon Fire,” InciWeb, 01-Oct-2012. . [73] Bureau of Land Management, “Lightner,” InciWeb, 09-Jul-2013. . [74] U.S. Forest Service, “Weber,” InciWeb, 09-Jul-2012. . [75] Bureau of Land Management, “Brush Creek,” InciWeb, 26-Jun-2013. . [76] U.S. Forest Service, “Springer,” InciWeb, 01-Oct-2012. . [77] U.S. Forest Service, “Duckett Fire,” InciWeb, 01-Oct-2012. . [78] U.S. Forest Service, “High Park Fire,” InciWeb, 24-Oct-2012. . [79] U.S. Forest Service, “Button Rock Fire,” InciWeb, 05-Jun-2012. . [80] Uncompahgre Field Office, Bureau of Land Management, “Sunrise Mine,” InciWeb, 07-Jun-2012. . [81] U.S. Forest Service, “Hewlett Fire,” InciWeb, 26-Jul-2012. . [82] U.S. Forest Service, “Little Sand,” InciWeb, 09-Jul-2012. . [83] Colorado State Forest Service, “Lower North Fork,” InciWeb, 02-Apr-2012. . [84] C. Minshew and D. J. Schneider, “2012 Colorado wildfires - at a glance,” Denver Post, Denver, Colorado, United States, 15-Jun-2013. [85] E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the flu: detecting influenza epidemics using Twitter,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2011, pp. 1568–1576.

Appendices

A. Selectors used for the first Hurricane Sandy Dataset 34 keywords:

baltraffic Ctsandy damage dcsandy dctraffic desandy evacuation flood Florida frankenstorm hurricane linedown mdtraffic njsandy njtraffic Nyctraffic nysandy outage outpage power sandy sandyde Shelter storm stormed surge tree treedown tropical Vatraffic water #hamptons #northfork #nofo

72 users:

1PolicePlaza 511NY 511NYC 511PAStatewide AACO_OEM BelAirVolFireCo breakingstorm CapitalAlert ccpa_net_dps911 CecilCountyDES ChesterfieldVa CityofVaBeach CraigatFEMA DarienTimes DC_HSEMA dcfireems DCPoliceDept DDOTDC Delaware delaware_gov

25

Peter M. Landwehr Thesis Proposal delawareonline DEStormInfo drgridlock fairfaxcounty fairfaxctycert FairfieldSun FEMAregion3 GovernorMarkell HenricoOEM HHCnyc HoCoGov HRD_AOML_NOAA HumanityRoad HurricaneAlerts JSHurricaneNews LoudounFire mayorsCAU MONOCEMS MontgomeryCoMD NEMA_DC NHC_Atlantic NHC_Surge nhregister nj1015 njdotcom NJNewsCommons Njtraffic NJTrafficAlerts NotifyNYC NYC_DOT Nycgov NYCTrafficCheck NYCWater nyvost PGCountyOEM postmetrogirl ReadyArlington RedCrossNCR starledger sussex_pio TheSJTimes thewatchmantwit TotalTrafficNYC tvnooz twc_hurricane usNWSgov VaDOT VEMAWeb wmata WTOP WTOPtraffic wunderground

The rectangular region bounded at the southwest corner by (29.571, -76.289) and at the northeast corner by (40.555, -75.212):

B. Selectors used for the second Hurricane Sandy Dataset 10 keywords:

DC Frankenstorm NOVA NYC PGH Pittsburgh Power Sandy Sandyinphilly Wind

C. Selectors used for the 2012 Colorado Wildfire dataset 9 keywords: colorado wildfire #cofire #cofires #colorado #wildfire #highparkfire #lowernorthforkfire

#pyramidmtnfire #waldocanyonfire #waldofire

26