Labeling, Analyzing, and Simulating Tweets Produced in Disasters

Total Page:16

File Type:pdf, Size:1020Kb

Labeling, Analyzing, and Simulating Tweets Produced in Disasters Peter M. Landwehr Thesis Proposal Labeling, Analyzing, and Simulating Tweets Produced in Disasters Peter Maciunas Landwehr Computation, Organizations, and Society Institute for Software Resource School of Computer Science Carnegie Mellon University [email protected] September 30, 2013 Abstract Relief organizations use social media to stay abreast of victims’ needs during disaster and to help in planning relief efforts. Organizations hire dedicated teams to find posts that contain substantive, actionable information. Simulation and training exercises play important roles in preparing for disaster; tools like HAZUS-MH and FlamMap help to predict the physical damages and costs that will result when hurricanes and wildfires strike. Yet few if any disaster simulations incorporate the ways that a disaster can cause an outpouring of comment and useful information on the web. In disasters such as the 2010 Haiti Earthquake, tweets and SMS messages have been translated and coded using crowdsourcing platforms to divide up labor. Machine learning has been used to parse tweets for keywords and then intuit earthquake epicenters from physical position. Researchers have developed several different coding schemes for tweets to identify different types of useful content. While unskilled coders can apply some schemes, others require nuanced, subjective judgments. I propose to address the problems posed by the dearth of training simulations and the difficulties of applying salient codings to tweets by developing a novel simulation of social media in disaster, experimenting with using crowdsourcing to apply a set of sophisticated labels, and using the coded to data to train a machine learning classifier. The new simulation will produce sequences of “pseudo-tweets” that consist of content labels and a collection of common features, such as hashtags and retweet indicators, as output. It will be validated against longitudinal analyses of tweets produced during Hurricane Sandy and the 2012 Colorado Wildfires, and can be incorporated into future disaster preparation tasks. Prior to analysis, a subset of tweets from each disaster will be coded with a sophisticated labeling scheme developed by Vieweg, using both crowd workers and trained linguistics students, with the results of each group’s coding compared against the other. The gold-standard data will be used to train a machine learning classifier for applying the codings in the future. Committee: Dr. Kathleen M. Carley (Chair) Dr. Jason Hong Dr. Jürgen Pfeffer Dr. Sarah Vieweg 1 Introduction: Problem Domain and Research Field Social media platforms are an established part of society. Their users continue to post content to them even when in the presence of the immediate onset of disaster. Relief Organizations have begun to see these posts (and the tendencies driving these posting practices) as something that can be leveraged to address their various functions in disaster. During Hurricane Sandy, for example, many people afflicted by the storm posted continuous updates to Twitter [1]. Individuals living in the area afflicted by 1 Peter M. Landwehr Thesis Proposal the 2008 spill of coal ash into the Tennessee River attempted to use Twitter to increase public awareness of the event [2]. Relief groups have correspondingly begun turning to Twitter as a source of additional information about disasters as they transpire [3], [4]. The central difficulty of working with social media is the large volume of noise present on the platform. While good search term choice can help to limit the amount of noise present in the system, locating relevant terms remains a difficult problem. One approach to this is that used by Sakaki et al. [5]. The researchers used the distribution of different earthquake-related keywords across Japan to predict the epicenters of earthquakes. Similarly, Asur and Huberman have used the relative frequencies of mentions of particular movies on Twitter to predict weekend box office performance [6]. Another group of researchers has looked at different ways that tweets can be categorized and grouped on the assumption that they may contain additional useful information. These tweets have ranged from the moderately interpreted —Naaman et al.’s categorization of tweets as “sharing information”- and the very subjective but likely useful –Verma et al.’s “situational awareness” label [7], [8]. Similarly, both researchers and the private sector are building tools to help relief groups rapidly filter social media data for relevant information [9]–[11]. While relief groups are working with Twitter data during real disasters in order to understand the situation on the ground, simulation has been understood to be a critical component of preparing for different disasters [12]. Yet despite knowing that social media will play a role in any disaster response little attention has been given to reproducing the form and content of the relief messages produced on social media. One exception to this was a live simulation of a disaster carried out at Arizona State University in 2011, in which students were recruited to produce proxy tweets [13]. However, the simulation did not incorporate mechanisms to guarantee that students would produce tweets in a manner resembling the true distribution of tweets in disaster. In this thesis, I will seek to address the dearth of training simulations for social media by developing a simulation that produces pseudo-tweets, each of which contains no text but includes a set of representative tweet features and a particular mix of category labels. In order to create data that can be used with the simulation, I will categorize a subset of two corpora of tweets using labels developed by Vieweg. These labeled tweets will be used as training data for a classifier to automatically code all tweets in the corpora that were not labeled in the first phase. Once categorized, all of the labeled tweets will be used as the basis for a general analysis of the two disasters as represented on Twitter. This analysis and the distributions of the different labels will then be used to provide the simulation a basis in reality. These labels will then be used as the basis for a face validation of the results. 2 Research Questions and Contributions This thesis will make four contributions, two secondary and two primary. Three of these contributions (crowdsourcing a sophisticated tweet coding scheme, developing a new classifier for tweets, and conducting a longitudinal analysis of tweets made during disasters) are intended to support the fourth goal, building a simulation of the tweets produced over the course of two different kinds of disasters. The contributions can each be thought of as addressing one of three research questions: • How can we inexpensively and rapidly classify tweets so that relief workers can use them? • How are disasters represented in the content of social media? • How can we usefully reproduce the content produced on social media during disasters? 2 Peter M. Landwehr Thesis Proposal With these research questions as a basis, the single thesis statement for this work would be: “In order to aid first responders in dealing with Twitter data gleaned during disaster, I propose to experiment with how to efficiently and meaningfully label tweets, to analyze how tweets produced in disaster relate to the events occurring in the disaster, and to develop a representative simulations of the tweets produced in a disaster situation.” Below, the four contributions are discussed in more detail. 2.1 Crowdsourcing a Sophisticated Coding Scheme When creating her own gold-standard data, Vieweg trained two linguistics graduate students in the nuances of her coding scheme and acted as an arbitrator for their decisions [14]. This process took time and effort, and cannot be applied to live data collected during a real disaster. As Vieweg herself mapped out how tweets with particular codes could contain data of interest to relief groups, it would be useful if her coding could be applied during the course of disaster as data arrives. Faster labeling would play an important role in getting additional information to authorities sooner, delivering aid more quickly, and reducing the costs of providing relief. In this thesis, I will attempt to use workers on a crowdsourcing platform to apply her coding scheme to tweets. Crowd workers have already played crucial roles coding data in other disasters, notably in the 2010 Haiti Earthquake [15]. The central challenge of this task will be converting Vieweg’s instructions into training that can be used to guarantee that crowd workers will perform quality coding despite lacking the same background training. 2.2 Developing a New Classifier for Tweets While successfully crowdsourcing the labeling of tweets will decrease the time and expense of coding, developing a functioning machine learning classifier for applying those same labels would have a more dramatic effect, either used alone or in tandem with crowdsourcing. Vieweg proposes and tests one theoretical model for a classifier in her thesis, and Verma et al. developed a very effective classifier for “situational awareness”, the top-level category in Vieweg’s hierarchy of labels [8]. A new classifier that can effectively apply Vieweg’s labels and categories to tweets that have been identified as possessing situational awareness will further decrease the cost of processing tweets while also pointing the way forward for a system for on-line coding of tweets during a disaster. Just as Vieweg’s labels exist in a hierarchy, it is likely that classification will be done using a set of classifiers to label tweets as containing situational awareness information, one or more categories, and a mix of category labels. Verma et al. were highly successful at training a situational awareness classifier for disaster data, and I hope to build on their success [8]. Given the number of labels developed by Vieweg and their uneven distribution in different disaster, it is conceivable that it won’t be possible to train a classifier for every label, but rather just for those present in my data.
Recommended publications
  • Dissertation Crises Unwasted
    DISSERTATION CRISES UNWASTED: HOW POLICY ENTREPRENEURS LINKED FOREST BIOMASS TO ENERGY SECURITY IN COLORADO, 1998-2013 Submitted by Mike Eckhoff Department of Forest and Rangeland Stewardship In partial fulfillment of the requirements For the Degree of Doctor of Philosophy Colorado State University Fort Collins, Colorado Summer 2014 Doctoral Committee: Advisor: Kurt Mackes Co-Advisor: Rick Knight Charles Davis Mark Fiege Marcia Patton-Mallory Douglas Rideout Copyright by Mike Eckhoff 2014 All Rights Reserved ABSTRACT CRISES UNWASTED: HOW POLICY ENTREPRENEURS LINKED FOREST BIOMASS TO ENERGY SECURITY IN COLORADO, 1998-2013 Colorado’s forests are facing threats from wildfires, insect and disease epidemics and human encroachment. At the same time, Coloradans are facing energy security problems from fossil fuel price volatility, unintended consequences from continued fossil fuel dependence, problematic alternative, non-renewable fuel promotions and a struggling renewable energy industry. Subsequently, natural resources managers in Colorado are facing two imposing challenges simultaneously: 1) the need to restore forest health and 2) to manage energy resources sustainably, equitably and with public safety in mind. Policy entrepreneurs invested in forest energy found ways to link forest health emergencies to energy security crises. This dissertation is a study that explores how that link was forged and what happened in Colorado as result, looking at the actions taken by the four major federal land management agencies (U.S. Forest Service, Bureau of Land Management, National Park Service and the U.S Fish and Wildlife Service). This study also traced briefly how the State of Colorado responded to these crises, too. First, this study qualitatively surveyed literature in the forest history and policy arenas and energy history and policy arenas to chart how prior events led to current conditions.
    [Show full text]
  • April 2009 Focus on Fire Safety: Planning for Wildfires
    Wildfire in Colorado Preliminary Report on the 2012 Wildfire Season January 16, 2013 Colorado Division of Fire Prevention and Control 690 Kipling Street # 2000 Lakewood, CO 80215 Phone: (303) 239-4600 Fax: (303) 239-5887 http://dfs.state.co.us Wildfire in Colorado 2012 – Preliminary Report Table of Contents Topic Page # The "Unprecedented” Wildfire Season 2 Emergency Fire Fund (EFF) Fires - 2012 3 Table: Emergency Fire Fund (EFF) Fires - 2012 3 Map: Emergency Fire Fund (EFF) Fires - 2012 4 Fire Management Assistance Grant (FMAG) Fires 4 Table: Fire Management Assistance Grant (FMAG) Fires 5 Insured Losses 5 Table: Colorado Wildfire Insured Losses 6 Wildfire Emergency Response Fund (WERF) 6 Statutory Change of Authorities 6 The Division of Fire Prevention & Control – Born from the Ashes of Tragedy 7 Division of Fire Prevention and Control (DFPC) Organizational Chart 8 DFPC Wildland Fire Management Section Organizational Chart 9 Wildland Fire Operations 10 DFPC Fire Management Regions 10 Wildland Fire Management Section Workload Indicators 11 Fire Aviation Program 12 2012 SEAT Operational Statistics 12 Off-Season Operations 12 Fire Billing and Cooperator Reimbursements 13 Fire Billing and Cooperator Reimbursement Workload Indicators 13 Colorado Department of Corrections: State Wildland Inmate Fire Team 14 Immediate Actions Taken to Improve the State’s Response to Wildfires 14 Appendices A Wildland Fires by County B Colorado's Largest Wildfires Credits National Interagency Fire Center; U.S. Forest Service; Colorado State Forest Service; Division of Fire Prevention and Control (NFIRS); Rocky Mountain Insurance Information Association; National Wildfire Coordinating Group website (Historical Incident ICS-209 Reports); Denver Post; Inciweb Incident Information System: http://inciweb.org and Always Remember website: http://wlfalwaysremember.org.
    [Show full text]
  • Evaluating the Reverse 9-1-1 System in Santa Clara County: Does the Process Work?
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 10-10-2014 Evaluating the Reverse 9-1-1 System in Santa Clara County: Does the Process Work? Areej M. Sadhan San Jose State University Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Emergency and Disaster Management Commons, and the Public Administration Commons Recommended Citation Sadhan, Areej M., "Evaluating the Reverse 9-1-1 System in Santa Clara County: Does the Process Work?" (2014). Master's Projects. 374. DOI: https://doi.org/10.31979/etd.f793-dxxv https://scholarworks.sjsu.edu/etd_projects/374 This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact [email protected]. Evaluating the Reverse 911 System in Santa Clara County: Does the Process Work? 1 Evaluating the Reverse 9-1-1 System in Santa Clara County: Does the Process Work? San José State University Author: Areej M. Sadhan Advisor: Dr. Frances Edwards Evaluating the Reverse 911 System in Santa Clara County: Does the Process Work? 2 CONTENTS INTRODUCTION…………………………………………………………………3 BACKGROUND OF REVERSE 911….………………………………………….4 LITRATURE REVIEW…………………………………………………………...9 METHODOLGY………………………………………………………………....16 FINDINGS..……………….…………………….………………………….….....17 CASE STUDIES…………….….…………………………………….…..…19 A. Colorado
    [Show full text]
  • Cedar Mesa Ranches CWPP
    Los Ranchitos Estates Community Wildfire Protection Plan July 2012 Prepared for: Los Ranchitos Estates Homeowners Association Durango, Colorado and Upper Pine River Fire Protection District Prepared by: Short Forestry, LLC 9582 Road 35.4 Mancos, Colorado 81328 Community Wildfire Protection Plan: Los Ranchitos Approval The Durango District ofthe Colorado State Forest Service has reviewed this Communi Wildfire Protection Plan and approves its content and certifies that it meets or exce SC ommunity Wildfire Protection Plan minimum standar.ds. Dat~7 The following entities have received a copy of this Community Wildfire Protection Plan and agree with and support its content and recommendations. Date date . a: zo/2. ------&/1 f~J La Plata ounty Office of Emergency Management Dat~ CSFS Durango CWPP-1 (3/18/08) Table of Contents 1. INTRODUCTION......................................................................................................... 1 2. BACKGROUND ........................................................................................................... 1 A. Location..................................................................................................................... 1 B. Community ............................................................................................................... 1 C. Local Fire History .................................................................................................... 3 D. Recent Wildfire Preparedness Activities ..............................................................
    [Show full text]