Identifying police killings from news text with distant supervision

Brendan O’Connor College of Information and Computer Sciences University of Massachusetts Amherst http://brenocon.com

Talk at NYU Center for Data Science, 4/27/17

Joint work with: Katherine Keith, Abram Handler, Michael Pinkham, Cara Magliozzi, Joshua McDuffie

1

Thursday, April 27, 17 Computational Social Science

1900 2000

Thursday, April 27, 17 Computational Social Science

Official social data Semi-structured social data

Digitized behavior Data collection Data analysis Data computation Billions of users, messages/day

Digitized news Thousands of articles/day

Digitized archives 100 BCE 1829 1890 Millions of books/century

1900 2000

Thursday, April 27, 17 Computational Social Science

Official social data Semi-structured social data

Digitized behavior Data collection Data analysis Data computation Billions of users, messages/day

Digitized news Thousands of articles/day

Digitized archives 100 BCE 1829 1890 Millions of books/century

1900 2000

Thursday, April 27, 17 TextGenerator(SocialAttributes) → Text

Society (SocialAttributes) Writing (TextGenerator) Text Data (Text) Data generation process

2. Infer: social determinants of language use Inferences e.g. bias, influence... from text P(Generator | Text, SocialAttributes)

1. Infer: attributes of society (language for measurement) e.g. opinion, communities, events... P(SocialAttributes | Text, Generator) Language for social measurement P(SocAttr | Text, TextGen)

4

Thursday, April 27, 17 TextGenerator(SocialAttributes) → Text

Society (SocialAttributes) Writing (TextGenerator) Text Data (Text) Data generation process

2. Infer: social determinants of language use Inferences Social media usage Public opinion e.g. bias, influence... from text Model assumptionsP(Generator | Text, SocialAttributes)

1. Infer: attributes of society (language for measurement) e.g. opinion, communities, events... P(SocialAttributes | Text, Generator) Language for social measurement P(SocAttr | Text, TextGen)

[O’Connor et al., ICWSM 2010] 5

Thursday, April 27, 17 TextGenerator(SocialAttributes) → Text

Society (SocialAttributes) Writing (TextGenerator) Text Data (Text) Data generation process

Real-world News media2. Infer: process social determinants of language use Inferences e.g. bias, influence... from textpolitical events Model assumptionsP(Generator | Text, SocialAttributes)

1. Infer: attributes of society (language for measurement) e.g. opinion, communities, events... P(SocialAttributes | Text, Generator) Language for social measurement Israeli−Palestinian Diplomacy A: Israel-Jordan Peace C: U.S. Calls for West Bank P(SocAttr | Text, TextGen) Treaty Withdrawal B: Hebron Protocol D: Deadlines for Wye River Peace

0.8 Accord E: Negotiations in Mecca F: Annapolis Conference [meet with, sign with, praise, say with, arrive A B C D E F

in, host, tell, welcome, join, thank, meet, 0.4 travel to, criticize, leave, take to, begin to, begin with, summon, reach with, hold with...] 0.0 [O’Connor, Stewart, Smith ACL 2013] 1994 1997 2000 2002 2005 2007 6

Thursday, April 27, 17 TextGenerator(SocialAttributes) → Text

Society (SocialAttributes) Writing (TextGenerator) Text Data (Text) Data generation process What to analyze: How to analyze: Social phenomena in textual 2. Infer: social determinants of language use Inferences NLP capabilitiese.g. bias, we influence... need to do from textdatasets theseP(Generator better | Text, SocialAttributes) Dialects in social media • 1. Infer: attributes of society• (languagePart of forspeech measurement) tagging language e.g. opinion, communities, events... P(SocialAttributes• |Syntactic, Text, Generator) semantic parsing • Events in international relations • Event extraction • Identifying police killings in news

7

Thursday, April 27, 17 8

Thursday, April 27, 17 Police killings

9

Thursday, April 27, 17 Police killings

July 17, 2014

9

Thursday, April 27, 17 Police killings

July 17, 2014

Aug 9, 2014

9

Thursday, April 27, 17 Police killings

July 17, 2014

Aug 9, 2014

July 5, 2016

9

Thursday, April 27, 17 Police killings

July 17, 2014

Aug 9, 2014

July 5, 2016

July 6, 2016

9

Thursday, April 27, 17 Police killings Data?

New York, Eric Garner NY July 17, 2014

Michael Ferguson, Aug 9, 2014 Brown MO

July 5, 2016 Alton Baton Sterling Rouge, LA

July 6, 2016 Philando Falcon Castile Heights, MN

9

Thursday, April 27, 17 Police killings Data?

New York, Eric Garner NY

Michael Ferguson, Brown MO

Alton Baton Sterling Rouge, LA

Philando Falcon Castile Heights, MN

10

Thursday, April 27, 17 Police killings Data?

New York, • Are there more or fewer Eric Garner fatalities than last year? NY

Michael Ferguson, Brown MO

Alton Baton Sterling Rouge, LA

Philando Falcon Castile Heights, MN

10

Thursday, April 27, 17 Police killings Data?

New York, • Are there more or fewer Eric Garner fatalities than last year? NY • Is there racial disparity/ discrimination? Michael Ferguson, Brown MO

Alton Baton Sterling Rouge, LA

Philando Falcon Castile Heights, MN

10

Thursday, April 27, 17 Police killings Data?

New York, • Are there more or fewer Eric Garner fatalities than last year? NY • Is there racial disparity/ discrimination? Michael Ferguson, • Which police departments are Brown MO better or worse? What policing strategies are most effective or safe? Alton Baton Sterling Rouge, LA

Philando Falcon Castile Heights, MN

10

Thursday, April 27, 17 Police killings Data?

New York, • Are there more or fewer Eric Garner fatalities than last year? NY • Is there racial disparity/ discrimination? Michael Ferguson, • Which police departments are Brown MO better or worse? What policing strategies are most effective or safe? Alton Baton Sterling Rouge, LA • Need good data for the public interest and social science / policy making Philando Falcon Castile Heights, MN

10

Thursday, April 27, 17 Issues in government data

11

Thursday, April 27, 17 Issues in government data

• Washington Post, Oct. 16, 2016: “Americans actually have no idea” about how often police use force because nobody has collected enough data.

11

Thursday, April 27, 17 Issues in government data

• Washington Post, Oct. 16, 2016: “Americans actually have no idea” about how often police use force because nobody has collected enough data.

11

Thursday, April 27, 17 U.S. Department of Justice Offce of Justice Programs Bureau of Justice Statistics

TECHNICAL REPORT October 2015, NCJ 249099 Assessment of Coverage in the Arrest-Related Deaths Program Issues in government data Duren Banks, Ph.D. and Lance Couzens, RTI International Michael Planty, Ph.D., Bureau of Justice Statistics

ExecutiveUnreliable summary partial compliance • FIGURE 1 between local agencies and federal Estimated number of law enforcement homicides Afer the passage of the Death in Custody Reporting Act and percent not reported, by data source, 2003–2009 (DICRA)government of 2000 (P.L. 106-297), the Bureau of Justice and 2011 Statistics (BJS) began collecting data on deaths that occurred Number of homicides reported in theMassively process of arrest. undercounts Provisions in the 2000 deaths DICRA • 8,000 Percent of expected deaths not reported to the ARD or SHR called[Banks for collecting et allal. deaths 2015 occurring (BJS/DOJ), within the process of arrest in any state, county, or local law enforcement 7,000 7,427 Lum and Ball 2015 (HRDAG, external)] (100%) 51% 54% 28% agency nationwide. From 2003 through 2009, BJS obtained 6,000 reports on 4,813 such deaths through its Arrest-Related DeathsUncertain (ARD) program. future About for3 in 5 ofDOJ these deaths 5,000 5,324 • (72%) (2,931)programs? were classifed as homicides by law enforcement 4,000 personnel. Te remaining 2 in 5 deaths were attributed to 3,000 3,620 3,385 other manners, including suicide (11%), intoxication deaths (49%) (46%) (11%), accidental injury (6%), and natural causes (5%).1 2,000 In[Compare: three-quarters (75%) National of homicides byJustice law enforcement 1,000 • personnel, the underlying ofense of arrest was a violent 0 oDatabase’sfense. No criminal voluntarycharges were intended participation in less than 2% Estimated Reported Reported Reported to ofapproach; these incidents. Center for Policing universe to ARD to SHR either ARD or SHR Source: Bureau of Justice Statistics, Arrest-Related Deaths (ARD) program, To assess the completeness of the ARD data that BJS 2003–2009 and 2011; and FBI, Supplementary Homicide Reports (SHR), received,Equity, in 2013John BJS undertookJay College] a technical review of the 2003–2009 and 2011. ARD program’s methodology and an assessment of the program’s coverage of all arrest-related deaths in the United States. Te methodology review examined the variation law enforcement homicides. It also showed that the data BJS in states’ approaches to identifying and confrming arrest- used for comparison purposes—the FBI’s Supplementary related deaths. Te assessment of coverage focused on 12 Homicide Reports (SHR)—also reported fewer justifable homicides than expected.2 In total, the BJS ARD program Thursday, April 27,determining 17 whether BJS received all arrest-related deaths that occurred or only a portion of them. Te primary focus data and the SHR data each identifed about half of the of the assessment of coverage was on homicides by law expected number of homicides by law enforcement ofcers enforcement ofcers. during the period from 2003 through 2009 and 2011 (fgure 1). Te ARD program captured approximately 49% Te analysis showed that the ARD program obtained fewer of these homicides, while the SHR captured 46%. More than law enforcement homicide deaths than expected, based on a quarter (28%) of law enforcement homicides in the United the methodology used to estimate the expected number of States were not captured by either system. Te analysis

1Arrest-Related Deaths, 2003–2009 - Statistical Tables (NCJ 235385, BJS 2Arrest-related Deaths Program Assessment: Technical Report (NCJ 248543, web, November 2011). BJS web, March 2015).

Celebrating 35 years Alternative: news media reports

• Populate a database by manually reading news articles (filtered by keyword search) • FatalEncounters.org, KilledByPolice.net, , Washington Post... • FE: volunteers have read 2M articles or ledes (!) • Augment with open records requests • BJS, Dec. 2016: media reports double the count compared to previous government collection efforts

13

Thursday, April 27, 17 Computational approach

• Goal: extract fatality records from a news corpus • Off-the-shelf event extractors work poorly (ACE, FrameNet training/ontologies) • Instead, train models for this problem (distant supervision+EM) • NLP and social analysis • Concrete, real-world tasks useful testbed for NLP research • Can NLP offer something useful for important tasks? • Public data and government accountability

14

Thursday, April 27, 17 Computational approach

New York, Eric Garner NY July 17, 2014

Michael Ferguson, Aug 9, 2014 Brown MO

July 5, 2016 Alton Baton Sterling Rouge, LA

July 6, 2016 Philando Falcon Castile Heights, MN

15

Thursday, April 27, 17 Task: Database Population

Time-delimited corpus Alton Baton Sterling Rouge, LA Infer names of persons killed by police during that Philando Falcon timeframe Castile Heights, MN

16

Thursday, April 27, 17 Task: Database Update

New York, Eric Garner NY

Historical data Michael Ferguson, (Distant supervision) Brown MO

Alton Baton Sterling Rouge, LA

Testing/Runtime Philando Falcon Castile Heights, MN

17

Thursday, April 27, 17 Mentions The Baton Rouge Police Department confirms that confirms Alton Sterling , 37 , died during a shooting at the Triple S Food Mart

... the two officers involved in Tuesday 's shooting of Alton Sterling ...

... Alton Sterling was a resident of Baton Rouge...

18

Thursday, April 27, 17 predict: describes Mentions police fatality? The Baton Rouge Police Department confirms that confirms Alton Sterling , 37 , died 0.4 during a shooting at the Triple S Food Mart

... the two officers involved in Tuesday 's 0.8 shooting of Alton Sterling ...

... Alton Sterling was a resident of Baton Rouge... 0.01

18

Thursday, April 27, 17 predict: describes Mentions police fatality? The Baton Rouge Police Department confirms that confirms Alton Sterling , 37 , died 0.4 during a shooting at the Triple S Food Mart

... the two officers involved in Tuesday 's 0.8 shooting of Alton Sterling ...

... Alton Sterling was a resident of Baton Rouge... 0.01

aggregate: add to database?

18

Thursday, April 27, 17 predict: describes Mentions police fatality? The Baton Rouge Police Department confirms that confirms Alton Sterling , 37 , died 0.4 during a shooting at the Triple S Food Mart

... the two officers involved in Tuesday 's 0.8 shooting of Alton Sterling ...

... Alton Sterling was a resident of Baton Rouge... 0.01

aggregate: add to database? Alton Sterling

Entity-level fatality record (name) 18

Thursday, April 27, 17 Pipeline (incl. training) Sentence pipeline Fatal Encounters

Sentences with names and keywords Distant labeling 1 news 0 reports 1

Jan - Dec 2016 ~1.1 million

Feature extraction Classifiers shallow semantic logreg parsing word CNN embeddings 18 19

Thursday, April 27, 17 Data

2013), fatalities are well defined. The task also Knowledge base Historical Test builds on a considerable information extraction lit- 2013), fatalities are well defined. The task also KnowledgeFE incident dates base Historical Jan 2000 – TestSep 2016 – erature on knowledge base population (e.g. Craven Aug 2016 Dec 2016 builds on a considerable information extraction lit- FE incidentgold entities dates ( ) Jan 17,219 2000 – Sep 452 2016 – eratureet al. (1998 on knowledge)). Finally, base the population field of (e.g. naturalCraven lan- G Aug 2016 Dec 2016 guage processing should, when possible, advance FENews gold dataset entities ( ) 17,219Train 452 Test et al. (1998)). Finally, the field of natural lan- G applications of important public interest. Previous Newsdoc. dates dataset TrainJan 2016 – TestSep 2016 – guage processing should, when possible, advance Aug 2016 Dec 2016 applicationswork established of important the value public of textual interest. news Previous for this doc.total dates docs. ( ) Jan 793,010 2016 – Sep 317,345 2016 – problem, but computational methods could allevi- total ments.D ( )Aug 132,833 2016 Dec 68,925 2016 work established the value of textual news for this total docs. ( M)+ 793,010 317,345 ate the scale of manual labor needed to use it. pos. ments.D ( ) 11,274 6,132 problem, but computational methods could allevi- total entities ments. (M ( )) 132,833 49,203 68,92524,550 ME++ ateTo the introduce scale of manual this problem, labor needed we: to use it. pos. ments.entities ( ( )) 11,274 916 6,132258 total entitiesME ( ) 49,203 24,550 To introduce this problem, we: E+ Define the task of identifying persons killed pos. entities ( ) 916 258 • Table 2: DataE statistics for Fatal Encounters (FE) by police, which is an instance of cross- Define the task of identifying persons killed and scraped news documents. and re- • document entity-event extraction• (Keyword-querying3.1). Table web 2:scraper Data running statistics throughout for Fatal 2016 EncountersM E (FE) by police, which is an instance§ of cross- sult from NER processing, while + results from • Preprocessing: textand extraction, scraped deduplication news documents. (shingling/unionE andfind), re- document entity-event extraction (spaCy3.1). NER+parsing,matching name cleanups textual named entities againstM theE gold- Present a new dataset of web news§ articles sult from NER processing, while + results from • standard database ( ). E Presentcollected a throughout new dataset 2016 of web that newsdescribe articles pos- matching textual20 namedG entities against the gold- • sible fatal encountersThursday, with April 27, 17 police officers standard database ( ). collected throughout 2016 that describe pos- G ( 3.2). sible§ fatal encounters with police officers to distant supervision—inducing labels by align- ( 3.2). ing relation-entity entries from a gold standard Introduce,§ for the database update setting, to distant supervision—inducing labels by align- • ingdatabase relation-entity to their mentions entries in from a corpus a gold (Craven standard and Introduce,a distant supervision for the database model ( update4) that setting, incor- Kumlien, 1999; Mintz et al., 2009; Bunescu and • § database to their mentions in a corpus (Craven and aporates distant feature-based supervision model logistic ( regression4) that incor- and Mooney, 2007; Riedel et al., 2010). Similar to this § Kumlien, 1999; Mintz et al., 2009; Bunescu and poratesconvolutional feature-based neural network logistic regression classifiers and un- work, Reschke et al. (2014) apply distant supervi- der a latent disjunction model. Mooney, 2007; Riedel et al., 2010). Similar to this convolutional neural network classifiers un- work,sion toReschke multi-slot, et al. template-based(2014) apply event distant extraction supervi- der a latent disjunction model. for airplane crashes; we focus on a simpler unary Demonstrate our system’s usefulness for sion to multi-slot, template-based event extraction • forextraction airplane setting crashes; with we joint focus learning on a simpler of a proba- unary Demonstratepractitioners: our it outperforms system’s usefulness two off-the- for • extractionbilistic model. setting Other with related joint learning work in of the a proba- cross- practitioners:shelf event extractors it outperforms ( 5) and finds two off-the- 39 per- § bilisticdocument model. setting Other has examined related work joint in inference the cross- for shelfsons event not included extractors in ( the5) and Guardian’s finds 39 “The per- § documentrelations, entities, setting has and examined events (Yao joint et al. inference, 2010; Lee for sonsCounted” not includeddatabase of in police the Guardian’s fatalities ( “The6). § relations,et al., 2012 entities,; Yang and et al. events, 2015 ().Yao et al., 2010; Lee Counted” database of police fatalities ( 6). 2 Related Work § et al.Finally,, 2012; otherYang natural et al., 2015 language). processing ef- forts have sought to extract social behavioral 2 Related Work Finally, other natural language processing ef- This task combines elements of information ex- fortsevent have databases sought from to extract news, such social as behavioral instances event extraction seman- Thistraction, task including: combines elements of information(a.k.a. ex- eventof protests databases (Hanna from, 2017 news,), gun such violence as instances (Pavlick tic parsing traction, including:), identifyingevent descriptions extraction (a.k.a. of eventsseman- and ofet al. protests, 2016 (),Hanna and international, 2017), gun relations violence ( (SchrodtPavlick tictheir parsing arguments), identifying from text, descriptions and cross-document of events and etand al. Gerner, 2016,),1994 and; internationalSchrodt, 2012 relations; Boschee (Schrodt et al., relation extraction their arguments from, predicting text, and semantic cross-document relations and2013 Gerner; O’Connor, 1994 et; al.Schrodt, 2013,;2012Gerrish; Boschee, 2013). et They al., relationover entities. extraction A fatality, predicting event indicates semantic the relations killing 2013can also; O’Connor be viewed et al. as, 2013 event; Gerrish database, 2013 population). They overof a entities. particular A fatality person; event we wish indicates to specifically the killing cantasks, also with be differing viewed as levels event of database semantic population specificity ofidentify a particular the names person; of fatality we wish victims to specifically mentioned tasks,in the definitionwith differing of “event.” levels of semantic specificity in text. Thus our task could be viewed as unary identify the names of fatality victims mentioned in3 the Task definition and Data of “event.” inrelation text. extraction: Thus our task for could a given be person viewed mentioned as unary relationin a corpus, extraction: were they for killed a given by person a police mentioned officer? 33.1 Task Cross-document and Data entity-event extraction in aPrior corpus, work were in they NLP killed has by produced a police a officer? number for police fatalties of event extraction systems, trained on text data 3.1 Cross-document entity-event extraction Prior work in NLP has produced a number Fromfor a corpus police of fatalties documents , the task is to ex- hand-labeled with a pre-specified ontology, in- D of event extraction systems, trained on text data tract a list of candidate person names, , and for cluding ones that identify instances of killings (Li From a corpus of documents , the taskE is to ex- hand-labeled with a pre-specified ontology, in- each e find D and Ji, 2014; Das et al., 2014). Unfortunately, they tract a list of candidate person names, , and for cluding ones that identify instances of killings (Li 2 E E perform poorly on our task ( 5), so we develop a each e find and Ji, 2014; Das et al., 2014). Unfortunately, they 2 E P (ye =1 x (e)). (1) § | M performnew method. poorly on our task ( 5), so we develop a Since we do not have access§ to text specifically Here y 0,P1 (yise =1 the entity-levelx (e)). label where(1) new method. 2 { } | M y =1 e annotatedSince we for do police not have killing access events, to text we instead specifically turn Heree y means0, 1 a personis the entity-level (entity) was label killed where by 2 { } annotated for police killing events, we instead turn ye =1means a person (entity) e was killed by Can NLP help?

21

Thursday, April 27, 17 Can NLP help? Computational Linguistics Artificial Intelligence

Thursday, April 27, 17 Can NLP help? Computational Linguistics Artificial Intelligence Case Grammar Frames Fillmore 1964, Schank and Abelson 1977, “Scripts, Plans, Goals, Theory “The Case for Case” Understanding”

Thursday, April 27, 17 Can NLP help? Computational Linguistics Artificial Intelligence Case Grammar Frames Fillmore 1964, Schank and Abelson 1977, “Scripts, Plans, Goals, Theory “The Case for Case” Understanding”

FrameNet MUC VerbNet ACE PropBank GENIA (bio) Datasets OntoNotes CAMEO (polisci)

Semantic Role Labeling Information Extraction Task Names Semantic Parsing Event Extraction

Thursday, April 27, 17 Can NLP help? • Evaluate two off-the-shelf event extractors • SEMAFOR: trained for FrameNet [Das et al. 2014] • RPI Joint Info. Extraction: trained for ACE [Li and Ji 2014] • Found useful for gun violence extraction [Pavlick and Callison-Burch 2016] • Classify an entity as killed by police if... • For at least one of their mentions, the extractor says... • (R1) a killing event took place, • and (R2) its patient is the mentioned person under consideration, • and (R3) its agent is described as police

Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 SEMAFOR R2 0.031 0.162 0.051 R3 0.098 0.009 0.016 RPI-JIE R1 0.016 0.447 0.030 RPI-JIE R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 DataData upper upper bound bound ( 4.6) 1.0 0.57 0.73 § 23

TableThursday, 6: April Precision, 27, 17 recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, models. and (3) the person who killed them is a police of- ficer. We implement a small progression of these Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: hard-LR, n-gram feats. 0.134 0.257 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • contains e . soft-CNN (EM) 0.164 0.267 i (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. Can NLP help? • Evaluate two off-the-shelf event extractors • SEMAFOR: trained for FrameNet [Das et al. 2014] • RPI Joint Info. Extraction: trained for ACE [Li and Ji 2014] • Found useful for gun violence extraction [Pavlick and Callison-Burch 2016] • Classify an entity as killed by police if... • For at least one of their mentions, the extractor says... • (R1) a killing event took place, • and (R2) its patient is the mentioned person under consideration, • and (R3) its agent is described as police

Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 SEMAFOR R2 0.031 0.162 0.051 R3 0.098 0.009 0.016 RPI-JIE R1 0.016 0.447 0.030 RPI-JIE R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 DataData upper upper bound bound ( 4.6) 1.0 0.57 0.73 § 23

TableThursday, 6: April Precision, 27, 17 recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, models. and (3) the person who killed them is a police of- ficer. We implement a small progression of these Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: hard-LR, n-gram feats. 0.134 0.257 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • contains e . soft-CNN (EM) 0.164 0.267 i (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. Can NLP help? • Evaluate two off-the-shelf event extractors • SEMAFOR: trained for FrameNet [Das et al. 2014] • RPI Joint Info. Extraction: trained for ACE [Li and Ji 2014] • Found useful for gun violence extraction [Pavlick and Callison-Burch 2016] • Classify an entity as killed by police if... • For at least one of their mentions, the extractor says... • (R1) a killing event took place, • and (R2) its patient is the mentioned person under consideration, • and (R3) its agent is described as police

Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 SEMAFOR R2 0.031 0.162 0.051 R3 0.098 0.009 0.016 RPI-JIE R1 0.016 0.447 0.030 RPI-JIE R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 DataData upper upper bound bound ( 4.6) 1.0 0.57 0.73 § 23

TableThursday, 6: April Precision, 27, 17 recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, models. and (3) the person who killed them is a police of- ficer. We implement a small progression of these Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: hard-LR, n-gram feats. 0.134 0.257 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • contains e . soft-CNN (EM) 0.164 0.267 i (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. Can NLP help? • Evaluate two off-the-shelf event extractors • SEMAFOR: trained for FrameNet [Das et al. 2014] • RPI Joint Info. Extraction: trained for ACE [Li and Ji 2014] • Found useful for gun violence extraction [Pavlick and Callison-Burch 2016] • Classify an entity as killed by police if... • For at least one of their mentions, the extractor says... • (R1) a killing event took place, • and (R2) its patient is the mentioned person under consideration, • and (R3) its agent is described as police

Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 SEMAFOR R2 0.031 0.162 0.051 R3 0.098 0.009 0.016 RPI-JIE R1 0.016 0.447 0.030 RPI-JIE R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 DataData upper upper bound bound ( 4.6) 1.0 0.57 0.73 § 23

TableThursday, 6: April Precision, 27, 17 recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, models. and (3) the person who killed them is a police of- ficer. We implement a small progression of these Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: hard-LR, n-gram feats. 0.134 0.257 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • contains e . soft-CNN (EM) 0.164 0.267 i (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. Can NLP help? • Evaluate two off-the-shelf event extractors • SEMAFOR: trained for FrameNet [Das et al. 2014] • RPI Joint Info. Extraction: trained for ACE [Li and Ji 2014] • Found useful for gun violence extraction [Pavlick and Callison-Burch 2016] • Classify an entity as killed by police if... • For at least one of their mentions, the extractor says... • (R1) a killing event took place, • and (R2) its patient is the mentioned person under consideration, • and (R3) its agent is described as police

Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 Hard problem! SEMAFOR R2 0.031 0.162 0.051 • R3 0.098 0.009 0.016 • Domain adaptation? RPI-JIE R1 0.016 0.447 0.030 Text cleanliness? RPI-JIE R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 Training data weirdness? DataData upper upper bound bound ( 4.6) 1.0 0.57 0.73 § 23

TableThursday, 6: April Precision, 27, 17 recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, models. and (3) the person who killed them is a police of- ficer. We implement a small progression of these Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: hard-LR, n-gram feats. 0.134 0.257 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • contains e . soft-CNN (EM) 0.164 0.267 i (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. Model • (1) Identify sentence-level fatality assertions

• (2) Aggregate to entity (person)-level predictions

24

Thursday, April 27, 17 Identifying civilians killed by police with distantly supervised entity-event Modelextraction Katherinegistic Keith, Abram model: Handler, Michael Pinkham, This is the direct unary predicate analogue of Cara Magliozzi, Joshua McDuffie, and Brendan O’Connor College of Information• (1) andIdentify Computer sentence-level Sciences fatality assertions Mintz et al. (2009)’s distant supervision assump- University of Massachusetts Amherst T [email protected],P [email protected](zi =1 xi)=( f(xi)). (2) tion, which assumes every mention of a gold- http://slanglab.cs.umass.edu/| describes (Draft, April 15 2017; currently under review) e.g. logistic regression, positive entity exhibits a description of a police police killing sentence convolutional neural network We experimentevent? with both logistic regression ( 4.4) killing. Abstract Text Person killed § and convolutional neural networksby police? ( 4.5) for this This assumption is not correct. We manually We propose a new, socially-impactful task Alton Sterling was killed by police. True § for natural language processing:component. from a Officers Then shot and killed wePhilando must Castile. somehow True aggregate news corpus, extract names of persons Officer Andrew Hanson was shot. False analyze a sample of positive mentions and find 36 who have been killed bymention-level police. We Police decisions report Megan Short was to fatally determine shot False entity labels in apparent murder-suicide. out of 100 name-only sentences did not express a present a newly collected police8 fatal- ity corpus, which we releaseye. publicly,If a human reader were to observe at least one (2) AggregateTable 1: Toy examples to entity (with entities (person)-level in bold) illus- predictions police fatality event—for example, sentences con- and present a model to solve this prob-• lem that uses EM-based distantsentence supervi- thattrating states the problem a ofperson extracting from was text killed names by police, tain commentary, or describe killings not by po- sion with logistic regression and convo- of persons who have been killed by police. lutional neural network classifiers.they would Our infer that person was killed by police. lice. This is similar to the precision for distant su- model outperforms two off-the-shelfTherefore event weDec. 2016) aggregate which augmented an traditional entity’s police- mention-level extractor systems, and identifies the names maintained records with media24 reports, finding pervision of binary relations found by Riedel et al. of 39 victims missed by one of the twice as many deaths compared to past govern- Thursday,labels April 27, 17 with a deterministic disjunction: major manually-collected police fatality ment analyses. This suggests textual news data has (2010), who reported 10–38% of sentences did not databases. enormous, real value, though manual news analy- sis remains extremely laborious. express the relation in question. 1 Introduction P (ye =1We proposez to(e help))=1 automate this processi by( ex-e) zi . (3) The United States government does not keep sys- tracting| theM names of persons killed_ by police2M from Our higher precision rule, name-and-location, tematic records of when police kill civilians, de- event descriptions in news articles (Table 1). This spite a clear need for this information to serve the can be formulated as either of two cross-document leverages the fact that the location of the fatality is At test time,entity-eventz is extraction latent. tasks: Therefore the correct public interest and support social scientific anal- i also in the Fatal Encounters database and requires ysis. Federal records rely on incompleteinference cooper- for1. Populating an entity an entity-event is to database: marginalize From a out the ation from local police departments, and human corpus of news articles (test) over timespan D both to be present: rights statisticians assess theymodel’s fail to document uncertaintyT , extract the over names ofz personsi: killed by po- thousands of fatalities (Lum and Ball, 2015). lice during that same timespan ( (pred)). E News articles have emerged as a valuable al- (train) ternative data source. Organizations including 2. Updating an entity-event database: In addi- zi =1if e : The Guardian, The Washington PostP, Mapping(y Po-=1xtion to ()=1test), assume accessP to(y both a=0 histor- x ) e (eD) (train) e (e) (4) 9 2 G (8) lice Violence, and Fatal Encounters have started ical database of killings and a histor- | M E | M to build such databases of U.S. police killings ical news corpus (train) for events that oc- name(i)=name(e) and location(e) xi. D ~ by manually reading millions of news articles=11 curredP (z before(Te.) This= setting0 oftenx occurs(e) in) (5) 2 and extracting victim names and event details. practice,M and is the focus of this| paper;M it al- This approach was recently validated by a Bu-=1 lows for the use(1 of distantlyP supervised(z =1 learn- x )). (6) We use this rule for training since precision is reau of Justice Statistics study (Banks et al., ing methods. i | i 1Fatal Encounters director D. Brian Burghart estimates he From thei NLP( researche) perspective, police fatali- slightly better, although there is still a consider- and colleagues have read 2 million news headlines and ledes ties are2 a promisingYM testbed for research: while the to assemble its fatality records that date back to January, 2000 able level of noise. (pers. comm.); we find FE to be the most comprehensive pub- semantics of events and their coreference suffer licly available database. Eq. 6 is thefromnoisyor complex philosophicalformula issues (Pearl (Hovy et, al.1988, ; Craven and Kumlien, 1999). Procedurally, it counts strong 4.3 “Soft” (EM) joint training probabilistic predictions as evidence, but can also At training time, the distant supervision assump- incorporate a large number of weaker signals as tion used in “hard” label training is flawed: many positive evidence as well.9 positively-labeled mentions are in sentences that In order to train these classifiers, we need do not assert the person was killed by a police of- mention-level labels (zi) which we impute via ficer. Alternatively, at training time we can treat two different distant supervision labeling meth- zi as a latent variable and assume, as our model ods: “hard” and “soft.” states, that at least one of the mentions asserts the fatality event, but leave uncertainty over which 4.2 “Hard” distant label training mention (or multiple mentions) conveys this in- In “hard” distant labeling, labels for mentions in formation. This corresponds to multiple instance the training data are heuristically imputed and di- learning (MIL; Dietterich et al. (1997)) which has rectly used for training. We use two labeling rules. been applied to distantly supervised relation ex- First, name-only: traction by enforcing the at least one constraint at training time (Bunescu and Mooney, 2007; Riedel z =1if e (train) : name(i)=name(e). et al., 2010; Hoffmann et al., 2011; Surdeanu et al., i 9 2 G (7) 2012; Ritter et al., 2013). Our approach differs by using exact marginal posterior inference for the E- 8 An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., 2009; Riedel et al., 2010); but here we opt to directly model at the mention level, which can use contextual information. 4.3.1 EM training for the latent disjunction 9In early experiments, we experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The maximum and arithmetic mean functions performed worse With zi as latent, the model can be trained with than noisyor, giving credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant mance as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield proper § probabilities between 0 and 1. by alternating E- and M-steps. gistic model: This is the direct unary predicate analogue of Mintz et al. (2009)’s distant supervision assump- T P (zi =1 xi)=( f(xi)). (2) tion, which assumes every mention of a gold- | positive entity exhibits a description of a police We experiment with both logistic regression ( 4.4) § killing. and convolutional neural networks ( 4.5) for this § This assumption is not correct. We manually component. Then we must somehow aggregate analyze a sample of positive mentions and find 36 mention-level decisions to determine entity labels out of 100 name-only sentences did not express a 8 Identifying civilians killed by policeye. withIf a distantly human supervised reader were entity-event to observe at least one police fatality event—for example, sentences con- Modelextraction sentence that states a person was killed by police, tain commentary, or describe killings not by po- Katherinegistic Keith, Abram model:they Handler, would Michael infer Pinkham, that person was killed by police.This is the direct unary predicate analogue of Cara Magliozzi, Joshua McDuffie,(1) Identifyand Brendan sentence-level O’Connor fatality assertions lice. This is similar to the precision for distant su- College of Information•Therefore and Computer we Sciences aggregate an entity’s mention-levelMintz et al. (2009)’s distant supervision assump- University of Massachusetts Amherst T pervision of binary relations found by Riedel et al. [email protected],labelsP [email protected](zi with=1 a deterministicxi)=( f disjunction:(xi)). (2) tion, which assumes every mention of a gold- http://slanglab.cs.umass.edu/| (2010), who reported 10–38% of sentences did not describes (Draft, April 15 2017; currently under review) e.g. logistic regression, positive entity exhibits a description of a police police killing sentence express the relation in question. convolutional neural network We experimentevent? P (ye with=1 bothz ( logistice))=1 regressioni (e) zi ( 4.4. ) (3)killing. Abstract Text| M Person_ killed2M § Our higher precision rule, name-and-location, and convolutional neural networksby police? ( 4.5) for this Thisleverages assumption the fact that is not the correct. location of We the manually fatality is We propose a new, socially-impactful taskAt testAlton time, Sterling wasz killedis by police. latent. True Therefore§ the correct for natural language processing:component. from a Officers Then shot and killed wei Philando must Castile. somehow True aggregate analyzealso in a sample the Fatal of Encounters positive mentions database and and requires find 36 news corpus, extract names of personsinferenceOfficer Andrew for Hanson anwas entity shot. is False to marginalize out the who have been killed bymention-level police. We Police decisions report Megan Short was to fatally determine shot False entity labels in apparent murder-suicide. outboth of 100 to be name-only present: sentences did not express a present a newly collected police8 fatal-model’s uncertainty over zi: ity corpus, which we releaseye. publicly,If a human reader were to observe at least one (2) AggregateTable 1: Toy examples to entity (with entities (person)-level in bold) illus- predictions police fatality event—for example, sentences con- and present a model to solve this prob-• (train) lem that uses EM-based distantsentence supervi- thattrating states the problem a ofperson extracting from was text killed names by police, tain commentary,zi =1if e or describe: killings not by po- sion with logistic regression and convo- P (ofye persons=1 whox have( beene))=1 killed by police.P (ye =0x (e)) (4) 9 2 G (8) | M | M lutional neural network classifiers.they would Our infer that person was killed by police. lice. Thisname is(i similar)=name to(e the) and precision location for(e) distantxi. su- model outperforms two off-the-shelfwas event person eDec. =1 2016) whichP augmented(z traditional= ~0 police-x ) (5) 2 Therefore we aggregateall sentences mentioning an(e entity’s) person e mention-level(e) extractor systems, and identifies thekilled names by police?maintained records withM media24 reports, finding| M pervision of binary relations found by Riedel et al. of 39 victims missed by one of the twice as many deaths compared to past govern- Thursday,labels April 27, 17 with a deterministic disjunction: We use this rule for training since precision is major manually-collected police fatality ment=1 analyses. This suggests textual(1 newsP data(z hasi =1 xi)). (6)(2010), who reported 10–38% of sentences did not databases. | enormous, real value,i though(e)manual news analy- slightly better, although there is still a consider- sis remains extremely laborious. express the relation in question. 1 Introduction 2YM P (ye =1We proposez to(e help))=1 automate this processi by( ex-e) zi . (3) able level of noise. The United States government does not keep sys- tracting| theM names of persons killed_ by police2M from Our higher precision rule, name-and-location, tematic records of when police kill civilians,Eq. de- 6eventis descriptions the noisyor in news articlesformula (Table 1). ( ThisPearl, 1988; Craven spite a clear need for this information to serve the can be formulated as either of two cross-document leverages4.3 “Soft” the fact (EM) that joint the location training of the fatality is At testand time, Kumlienentity-eventz is extraction, 1999 latent. tasks:). Procedurally, Therefore the it counts correct strong public interest and support social scientific anal- i also in the Fatal Encounters database and requires ysis. Federal records rely on incompleteinference cooper-probabilistic for1. Populating an entity predictions an entity-event is to database: marginalize as evidence, From a but out can the also At training time, the distant supervision assump- ation from local police departments, and human corpus of news articles (test) over timespan rights statisticians assess they fail to documentincorporate a large numberD of weaker signals asboth to be present: model’s uncertaintyT , extract the over names ofz personsi: killed by po- tion used in “hard” label training is flawed: many thousands of fatalities (Lum and Ball, 2015). lice during that same timespan ( 9(pred)). positive evidence as well.E positively-labeled mentions are in sentences that News articles have emerged as a valuable al- (train) ternative data source. Organizations including 2. Updating an entity-event database: In addi- zi =1if e : The Guardian, The Washington PostP, Mapping(y Po-=1In orderxtion to ()=1 totest), assume train accessP these to(y both a=0 classifiers,histor- x ) we need do not assert the person was killed by a police of- e (eD) (train) e (e) (4) 9 2 G (8) lice Violence, and Fatal Encounters have started Mical database of killings and a histor- M mention-level| labels(trainE) (z ) which| we impute via ficer. Alternatively, at training time we can treat to build such databases of U.S. police killings ical news corpus fori events that oc- name(i)=name(e) and location(e) xi. D =11 curredP (z before T . This= setting~0 oftenx occurs in) (5) 2 by manually reading millions of news articlestwo different distant(e) supervision(e) labeling meth- zi as a latent variable and assume, as our model and extracting victim names and event details. practice,M and is the focus of this| paper;M it al- This approach was recently validated by aods: Bu- “hard”lows for and the use “soft.” of distantly supervised learn- Westates, use this that ruleat least for training one of the since mentions precision asserts is =1 ing methods. (1 P (zi =1 xi)). (6) reau of Justice Statistics study (Banks et al., | the fatality event, but leave uncertainty over which 1Fatal Encounters director D. Brian Burghart estimates he From thei NLP( researche) perspective, police fatali- slightly better, although there is still a consider- and colleagues have read 2 million news headlines and4.2 ledes ties “Hard” are2 a promisingYM distant testbed for research: label while training the to assemble its fatality records that date back to January, 2000 ablemention level of (or noise. multiple mentions) conveys this in- (pers. comm.); we find FE to be the most comprehensive pub- semantics of events and their coreference suffer licly available database. Eq. 6 isIn the “hard”fromnoisyor complex distant philosophicalformula labeling, issues (Pearl (Hovy labels et, al.1988, for; mentionsCraven in formation. This corresponds to multiple instance and Kumlienthe training, 1999 data). Procedurally, are heuristically it counts imputed strong and di-4.3learning “Soft” (MIL; (EM)Dietterich joint training et al. (1997)) which has been applied to distantly supervised relation ex- probabilisticrectly used predictions for training. as evidence, We use two but labeling can also rules.At training time, the distant supervision assump- First, traction by enforcing the at least one constraint at incorporatename-only: a large number of weaker signals as tion used in “hard” label training is flawed: many 9 training time (Bunescu and Mooney, 2007; Riedel positive evidencez =1if ase well.(train) : name(i)=name(e). positively-labeledet al., 2010; Hoffmann mentions et al. are, 2011 in; sentencesSurdeanu et that al., i 9 2 G In order to train these classifiers, we need (7)do not2012 assert; Ritter the et person al., 2013 was). Our killed approach by a police differs of- by mention-level labels (zi) which we impute via ficer.using Alternatively, exact marginal at posterior training inference time we can for the treat E- 8 two differentAn alternative distant approach supervision is to aggregate labeling features meth- across zi asstep. a latent variable and assume, as our model mentions into an entity-level feature vector (Mintz et al., ods: “hard”2009; Riedel and “soft.” et al., 2010); but here we opt to directly model states, that at least one of the mentions asserts at the mention level, which can use contextual information. the4.3.1 fatality EM event, training but leave for the uncertainty latent disjunction over which 4.2 “Hard”9In early distant experiments, label we training experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The mention (or multiple mentions) conveys this in- In “hard”maximum distant and labeling, arithmetic mean labels functions for mentions performed in worse formation.With zi as This latent, corresponds the model to can multiple be trained instance with the trainingthan noisyor data, are giving heuristically credence to the imputed disjunction and model. di- The learningthe EM (MIL; algorithmDietterich (Dempster et al. et(1997 al., 1977)) which). We has ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant rectly usedmance for as noisyor training.—perhaps We| use because two it labeling too can use rules. weak sig- been applied to distantly supervised relation ex- P labels ( 4.2), and then learn improved parameters First, name-only:nals, unlike mean or max—though it does not yield proper traction by§ enforcing the at least one constraint at probabilities between 0 and 1. by alternating E- and M-steps. training time (Bunescu and Mooney, 2007; Riedel z =1if e (train) : name(i)=name(e). et al., 2010; Hoffmann et al., 2011; Surdeanu et al., i 9 2 G (7) 2012; Ritter et al., 2013). Our approach differs by using exact marginal posterior inference for the E- 8 An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., 2009; Riedel et al., 2010); but here we opt to directly model at the mention level, which can use contextual information. 4.3.1 EM training for the latent disjunction 9In early experiments, we experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The maximum and arithmetic mean functions performed worse With zi as latent, the model can be trained with than noisyor, giving credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant mance as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield proper § probabilities between 0 and 1. by alternating E- and M-steps. gistic model: This is the direct unary predicate analogue of Mintz et al. (2009)’s distant supervision assump- T P (zi =1gistic model:xi)=( f(xi)). (2) tionThis, which is the direct assumes unary every predicate mention analogue of of a gold- | Mintz et al. (2009)’s distant supervision assump- T positive entity exhibits a description of a police P (zi =1 xi)=( f(xi)). (2) tion, which assumes every mention of a gold- We experiment with both logistic| regression ( 4.4) § killing.positive entity exhibits a description of a police and convolutionalWe experiment neural networks with both logistic ( 4.5 regression) for this ( 4.4) § Thiskilling. assumption is not correct. We manually and convolutional neural networks§ ( 4.5) for this component. Then we must somehow aggregate§ analyzeThis a assumption sample of is positive not correct. mentions We manually and find 36 mention-levelcomponent. decisions to Then determine we must somehow entity labels aggregate analyze a sample of positive mentions and find 36 mention-level decisions to determine entity labels out of 100 name-only sentences did not express a 8 out of 100 name-only sentences did not express a y . If a human reader8 were to observe at least one Modele ye. If a human reader were to observe at least one policepolice fatality fatality event—for event—for example, example, sentences sentences con- con- sentence thatx statessentencez a that person states was a person killed was by killed police, by police, taintain commentary, commentary, or or describe describe killings killings not by not po- by po- they would inferthey that would person infer that was person killed was by killed police. by police. lice. This is similar to the precision for distant su- xThereforez weOR aggregatey an entity’s mention-level lice. This is similar to the precision for distant su- Therefore we aggregate an entity’s mention-level pervision of binary relations found by Riedel et al. xlabelsz with a deterministic disjunction: pervision(2010), who of binary reported relations 10–38% of found sentences by didRiedel not et al. labels with a deterministic disjunction: (2010express), who the relation reported in question. 10–38% of sentences did not P (ye =1 z (e))=1 i (e) zi . (3) • Prediction through disjunction:| M _ 2M expressOur the higher relation precision in rule, question.name-and-location, DecideP (ye an=1 entityz was( ekilled))=1 by police,i (e) zi . (3) leverages the fact that the location of the fatality is • At test time, zi is latent. Therefore the correct Our higher precision rule, name-and-location, if at least one |of theirM sentences _asserts2M they were also in the Fatal Encounters database and requires inference for an entity is to marginalize out the killed by police leveragesboth to be the present: fact that the location of the fatality is At test time, model’szi is latent. uncertainty Therefore over zi: the correct → also in the Fatal Encounters database and requires •inferenceIntegrate over for anx entityz uncertainty: is to marginalize noisyor out the (train) [e.g. Craven and Kumlien 1999] zi =1if e : P (ye =1x (e))=1 P (ye =0x (e)) (4) both to be present:9 2 G (8) M M model’s uncertainty over| zi: | name(i)=name(e) and location(e) xi. =1 P (z (e) = ~0 x (e)) (5) 2 M | M (train) Wezi use=1 thisif rulee for training: since precision is P (y =1x )=1=1 P (y(1 =0P (zxi =1 )xi))(4). (6) e (e) e (e|) 9 2 G (8) | M i (e) | M slightly(i better,)= although(e) there is still a consider-(e) x . 2YM name name and location i was person e =1 P (z = ~0 x ) (5) able level of noise. 2 all sentences mentioning(e) person e (e) killed by police? Eq. 6 Mis25 the noisyor| formulaM (Pearl, 1988; Craven Thursday, April 27, 17 =1 and Kumlien(1 , 1999P ().z Procedurally,=1 x )) it. counts(6) strong We4.3 use “Soft” this (EM) rule joint for training training since precision is i | i probabilistici (e) predictions as evidence, but can also slightlyAt training better, time, although the distant there supervision is still assump- a consider- incorporate2YM a large number of weaker signals as abletion levelused of in “hard” noise. label training is flawed: many 9 Eq. 6 is the noisyorpositiveformula evidence as (Pearl well. , 1988; Craven positively-labeled mentions are in sentences that In order to train these classifiers, we need do not assert the person was killed by a police of- and Kumlien, 1999). Procedurally, it counts strong 4.3 “Soft” (EM) joint training mention-level labels (zi) which we impute via ficer. Alternatively, at training time we can treat probabilistic predictionstwo different as distant evidence, supervision but labeling can also meth- Atz trainingi as a latent time, variable the anddistant assume, supervision as our model assump- incorporate aods: large “hard” number and “soft.” of weaker signals as tionstates,used that inat “hard” least one labelof training the mentions is flawed: asserts many positive evidence as well.9 positively-labeledthe fatality event, but mentions leave uncertainty are in over sentences which that 4.2 “Hard” distant label training mention (or multiple mentions) conveys this in- In order toIn train “hard” these distant labeling, classifiers, labels we for mentions need in doformation. not assert This the corresponds person was to multiple killed by instance a police of- mention-levelthe labels training (z datai) which are heuristically we impute imputed via and di- ficer.learning Alternatively, (MIL; Dietterich at ettraining al. (1997 time)) which we has can treat been applied to distantly supervised relation ex- two differentrectly distant used supervision for training. We labeling use two labeling meth- rules. zi as a latent variable and assume, as our model First, name-only: traction by enforcing the at least one constraint at ods: “hard” and “soft.” states,training that timeat (Bunescu least one and Mooneyof the, 2007 mentions; Riedel asserts z =1if e (train) : name(i)=name(e). theet fatality al., 2010 event,; Hoffmann but et leave al., 2011 uncertainty; Surdeanu et over al., which i 9 2 G 4.2 “Hard” distant label training (7) mention2012; Ritter (or et multiple al., 2013). mentions) Our approach conveys differs by this in- using exact marginal posterior inference for the E- In “hard” distant8 labeling, labels for mentions in formation. This corresponds to multiple instance An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., learning (MIL; Dietterich et al. (1997)) which has the training data2009 are; Riedel heuristically et al., 2010); but imputed here we opt to and directly di- model rectly used forat training. the mention level, We which use two can use labeling contextual information. rules. been4.3.1 applied EM training to distantly for the latent supervised disjunction relation ex- 9In early experiments, we experimented with other, more model First, name-only: traction by enforcing the at least one constraint at ad-hoc aggregation rules with a “hard”-trained model. The With z as latent, the model can be trained with maximum and arithmetic mean functions performed worse trainingi time (Bunescu and Mooney, 2007; Riedel than noisyor(train, giving) credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- zi =1if esum rule ( i P (z:i name=1 x(i))=) had similarname ranking(e). perfor- et al.tialize, 2010 the; modelHoffmann by training et al. on, the2011 “hard”; Surdeanu distant et al., 9 mance2 G as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield(7) proper 2012; Ritter§ et al., 2013). Our approach differs by probabilities between 0 and 1. usingby alternating exact marginal E- and M-steps. posterior inference for the E- 8 An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., 2009; Riedel et al., 2010); but here we opt to directly model at the mention level, which can use contextual information. 4.3.1 EM training for the latent disjunction 9In early experiments, we experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The maximum and arithmetic mean functions performed worse With zi as latent, the model can be trained with than noisyor, giving credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant mance as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield proper § probabilities between 0 and 1. by alternating E- and M-steps. gisticMention-level model: models This is the direct unary predicate analogue of Mintz et al. (2009)’s distant supervision assump- T P (zi =1 xi)=( f(xi)). (2) tion, which assumes every mention of a gold- | describes positive entity exhibits a description of a police Wepolice experiment killing withsentence both logistic regression ( 4.4) event? § killing. and convolutional neural networks ( 4.5) for this § This assumption is not correct. We manually 1.component.Feature-engineered Then we must somehow aggregate logistic regression analyze a sample of positive mentions and find 36 mention-level• Syntactic dependency decisions paths to determine entity labels out of 100 name-only sentences did not express a 8 N-grams ye. • If a human reader were to observe at least one police fatality event—for example, sentences con- sentence that states a person was killed by police, wait tain commentary, or describe killings not by po- for the they2. Convolutional would neural infer network that personvideo was killed by police. and lice. This is similar to the precision for distant su- [e.g. Nguyen and Grishman 2015] do n't Therefore we aggregate anrent entity’s mention-level it pervision of binary relations found by Riedel et al. labels with a deterministic disjunction: 26 (2010), who reported 10–38% of sentences did not Thursday, April 27, 17 Figure 1: Model architecture with two channels for an example sentence. express the relation in question. necessary) is represented as that is kept static throughout training and one that P (ye =1 z (e))=1 i (e) ziis fine-tuned. via backpropagation(3) (section 3.2).2 x = x x ... x , (1) In the multichannel architecture, illustrated in fig- M 1:n 1 2 2M n Our higher precision rule, name-and-location, | _ ure 1, each filter is applied to both channels and where is the concatenation operator. In gen- the results are added to calculate c in equation eral, let x refer to the concatenation of words i i:i+j (2). The model is otherwise equivalent to the sin- x , x ,...,x . A convolution operation in- leverages the fact that the location of the fatality is i i+1 i+j gle channel architecture. volves a filter w Rhk, which is applied to a At test time, zi is latent. Therefore2 the correct window of h words to produce a new feature. For 2.1 Regularization also in the Fatal Encounters database and requires example, a feature ci is generated from a window For regularization we employ dropout on the inference for an entityof words isxi:i+ toh 1 by marginalize out the penultimate layer with a constraint on l2-norms of ci = f(w xi:i+h 1 + b). (2) the weight vectors (Hinton et al., 2012). Dropoutboth to be present: · prevents co-adaptation of hidden units by ran- model’s uncertainty overHere b zRiis: a bias term and f is a non-linear 2 domly dropping out—i.e., setting to zero—a pro- function such as the hyperbolic tangent. This filter portion p of the hidden units during foward- is applied to each possible window of words in the backpropagation. That is, given the penultimate sentence x1:h, x2:h+1,...,xn h+1:n to produce (train) { } layer z =[ˆc1,...,cˆm] (note that here we have m a feature map filters), instead of using z =1if e : P (y =1x )=1 P (y =0x ) (4) i e (e) c =[c1,c2,...,cen h+1], (3) (e) 9 2 G (8) y = w z + b (4) · | M n h+1 | M with c R . We then apply a max-over- 2 name(i)=name(e) and location(e) xi. time pooling~ operation (Collobert et al., 2011) for output unit y in forward propagation, dropout =1 P (z over= the feature0 map andx take the maximum) value uses (5) 2 (ecˆ)= max c as the feature corresponding(e) to this y = w (z r)+b, (5) { } · M particular filter. The| idea is toM capture the most im- where is the element-wise multiplication opera- portant feature—one with the highest value—for tor and r Rm is a ‘masking’ vector of Bernoulli each feature map. This pooling scheme naturally 2 We use this rule for training since precision is random variables with probability p of being 1. =1 (1deals with variableP ( sentencezi lengths.=1 xi)). (6) Gradients are backpropagated only through the We have described the process by which one | unmasked units. At test time, the learned weight feature is extracted from one filter. The model slightly better, although there is still a consider- vectors are scaled by p such that wˆ = pw, and i (e) uses multiple filters (with varying window sizes) wˆ is used (without dropout) to score unseen sen- 2YM to obtain multiple features. These features form tences. We additionally constrain l -norms of the the penultimate layer and are passed to a fully con- 2 able level of noise. weight vectors by rescaling w to have w = s nected softmax layer whose output is the probabil- || ||2 whenever w >safter a gradient descent step. ity distribution over labels. || ||2 Eq. 6 is the noisyor formulaIn one of the model (Pearl variants, we, experiment19882We employ; Craven language from computer vision where a color with having two ‘channels’ of word vectors—one image has red, green, and blue channels. 4.3 “Soft” (EM) joint training and Kumlien, 1999). Procedurally, it counts1747 strong probabilistic predictions as evidence, but can also At training time, the distant supervision assump- incorporate a large number of weaker signals as tion used in “hard” label training is flawed: many positive evidence as well.9 positively-labeled mentions are in sentences that In order to train these classifiers, we need do not assert the person was killed by a police of- mention-level labels (zi) which we impute via ficer. Alternatively, at training time we can treat two different distant supervision labeling meth- zi as a latent variable and assume, as our model ods: “hard” and “soft.” states, that at least one of the mentions asserts the fatality event, but leave uncertainty over which 4.2 “Hard” distant label training mention (or multiple mentions) conveys this in- In “hard” distant labeling, labels for mentions in formation. This corresponds to multiple instance the training data are heuristically imputed and di- learning (MIL; Dietterich et al. (1997)) which has rectly used for training. We use two labeling rules. been applied to distantly supervised relation ex- First, name-only: traction by enforcing the at least one constraint at training time (Bunescu and Mooney, 2007; Riedel z =1if e (train) : name(i)=name(e). et al., 2010; Hoffmann et al., 2011; Surdeanu et al., i 9 2 G (7) 2012; Ritter et al., 2013). Our approach differs by using exact marginal posterior inference for the E- 8 An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., 2009; Riedel et al., 2010); but here we opt to directly model at the mention level, which can use contextual information. 4.3.1 EM training for the latent disjunction 9In early experiments, we experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The maximum and arithmetic mean functions performed worse With zi as latent, the model can be trained with than noisyor, giving credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant mance as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield proper § probabilities between 0 and 1. by alternating E- and M-steps. gisticMention-level model: models This is the direct unary predicate analogue of Mintz et al. (2009)’s distant supervision assump- T P (zi =1 xi)=( f(xi)). (2) tion, which assumes every mention of a gold- | describes positive entity exhibits a description of a police Wepolice experiment killing withsentence both logistic regression ( 4.4) event? § killing. and convolutional neural networks ( 4.5) for this § This assumption is not correct. We manually 1.component.Feature-engineered Then we must somehow aggregate logistic regression TARGET analyze a sample of positive mentions and find 36 mention-level• Syntactic dependency decisions paths to determine entity labels out of 100 name-only sentences did not express a 8 N-grams ye. • If a human reader were to observe at least one police fatality event—for example, sentences con- sentence that states a person was killed by police, wait tain commentary, or describe killings not by po- for the they2. Convolutional would neural infer network that personvideo was killed by police. and lice. This is similar to the precision for distant su- [e.g. Nguyen and Grishman 2015] do n't Therefore we aggregate anrent entity’s mention-level it pervision of binary relations found by Riedel et al. labels with a deterministic disjunction: 26 (2010), who reported 10–38% of sentences did not Thursday, April 27, 17 Figure 1: Model architecture with two channels for an example sentence. express the relation in question. necessary) is represented as that is kept static throughout training and one that P (ye =1 z (e))=1 i (e) ziis fine-tuned. via backpropagation(3) (section 3.2).2 x = x x ... x , (1) In the multichannel architecture, illustrated in fig- M 1:n 1 2 2M n Our higher precision rule, name-and-location, | _ ure 1, each filter is applied to both channels and where is the concatenation operator. In gen- the results are added to calculate c in equation eral, let x refer to the concatenation of words i i:i+j (2). The model is otherwise equivalent to the sin- x , x ,...,x . A convolution operation in- leverages the fact that the location of the fatality is i i+1 i+j gle channel architecture. volves a filter w Rhk, which is applied to a At test time, zi is latent. Therefore2 the correct window of h words to produce a new feature. For 2.1 Regularization also in the Fatal Encounters database and requires example, a feature ci is generated from a window For regularization we employ dropout on the inference for an entityof words isxi:i+ toh 1 by marginalize out the penultimate layer with a constraint on l2-norms of ci = f(w xi:i+h 1 + b). (2) the weight vectors (Hinton et al., 2012). Dropoutboth to be present: · prevents co-adaptation of hidden units by ran- model’s uncertainty overHere b zRiis: a bias term and f is a non-linear 2 domly dropping out—i.e., setting to zero—a pro- function such as the hyperbolic tangent. This filter portion p of the hidden units during foward- is applied to each possible window of words in the backpropagation. That is, given the penultimate sentence x1:h, x2:h+1,...,xn h+1:n to produce (train) { } layer z =[ˆc1,...,cˆm] (note that here we have m a feature map filters), instead of using z =1if e : P (y =1x )=1 P (y =0x ) (4) i e (e) c =[c1,c2,...,cen h+1], (3) (e) 9 2 G (8) y = w z + b (4) · | M n h+1 | M with c R . We then apply a max-over- 2 name(i)=name(e) and location(e) xi. time pooling~ operation (Collobert et al., 2011) for output unit y in forward propagation, dropout =1 P (z over= the feature0 map andx take the maximum) value uses (5) 2 (ecˆ)= max c as the feature corresponding(e) to this y = w (z r)+b, (5) { } · M particular filter. The| idea is toM capture the most im- where is the element-wise multiplication opera- portant feature—one with the highest value—for tor and r Rm is a ‘masking’ vector of Bernoulli each feature map. This pooling scheme naturally 2 We use this rule for training since precision is random variables with probability p of being 1. =1 (1deals with variableP ( sentencezi lengths.=1 xi)). (6) Gradients are backpropagated only through the We have described the process by which one | unmasked units. At test time, the learned weight feature is extracted from one filter. The model slightly better, although there is still a consider- vectors are scaled by p such that wˆ = pw, and i (e) uses multiple filters (with varying window sizes) wˆ is used (without dropout) to score unseen sen- 2YM to obtain multiple features. These features form tences. We additionally constrain l -norms of the the penultimate layer and are passed to a fully con- 2 able level of noise. weight vectors by rescaling w to have w = s nected softmax layer whose output is the probabil- || ||2 whenever w >safter a gradient descent step. ity distribution over labels. || ||2 Eq. 6 is the noisyor formulaIn one of the model (Pearl variants, we, experiment19882We employ; Craven language from computer vision where a color with having two ‘channels’ of word vectors—one image has red, green, and blue channels. 4.3 “Soft” (EM) joint training and Kumlien, 1999). Procedurally, it counts1747 strong probabilistic predictions as evidence, but can also At training time, the distant supervision assump- incorporate a large number of weaker signals as tion used in “hard” label training is flawed: many positive evidence as well.9 positively-labeled mentions are in sentences that In order to train these classifiers, we need do not assert the person was killed by a police of- mention-level labels (zi) which we impute via ficer. Alternatively, at training time we can treat two different distant supervision labeling meth- zi as a latent variable and assume, as our model ods: “hard” and “soft.” states, that at least one of the mentions asserts the fatality event, but leave uncertainty over which 4.2 “Hard” distant label training mention (or multiple mentions) conveys this in- In “hard” distant labeling, labels for mentions in formation. This corresponds to multiple instance the training data are heuristically imputed and di- learning (MIL; Dietterich et al. (1997)) which has rectly used for training. We use two labeling rules. been applied to distantly supervised relation ex- First, name-only: traction by enforcing the at least one constraint at training time (Bunescu and Mooney, 2007; Riedel z =1if e (train) : name(i)=name(e). et al., 2010; Hoffmann et al., 2011; Surdeanu et al., i 9 2 G (7) 2012; Ritter et al., 2013). Our approach differs by using exact marginal posterior inference for the E- 8 An alternative approach is to aggregate features across step. mentions into an entity-level feature vector (Mintz et al., 2009; Riedel et al., 2010); but here we opt to directly model at the mention level, which can use contextual information. 4.3.1 EM training for the latent disjunction 9In early experiments, we experimented with other, more model ad-hoc aggregation rules with a “hard”-trained model. The maximum and arithmetic mean functions performed worse With zi as latent, the model can be trained with than noisyor, giving credence to the disjunction model. The the EM algorithm (Dempster et al., 1977). We ini- sum rule ( i P (zi =1 xi)) had similar ranking perfor- tialize the model by training on the “hard” distant mance as noisyor—perhaps| because it too can use weak sig- P labels ( 4.2), and then learn improved parameters nals, unlike mean or max—though it does not yield proper § probabilities between 0 and 1. by alternating E- and M-steps. Distant supervision Model entities entity label sentences sent. label

e not in database: Katy enforce hard 0 label Perry 0 “Katy Perry reacts to police killings.” 0

“Alton Sterling was killed by police.” ?1 e in database: Alton 1 assume at least one is positive Sterling “Alton Sterling was a resident of (latent variable!) Baton Rouge. ?1

24 • Multiple instance learning [Bunescu and Mooney 2007] • Much more accurate than assuming every sentence asserts the event! • Probabilistic joint training: account for this uncertainty by maximizing marginal likelihood P (y x)= P (y z)P (z x) | | ✓ | z X 27

Thursday, April 27, 17 Features D1 length 3 dependency paths that include TARGET: word, POS, dep. label D2 length 3 dependency paths that include TARGET: word and dep. label D3 length 3 dependency paths that include TARGET: word and POS D4 all length 2 dependency paths with word, POS, dep. labels N1 n-grams length 1, 2, 3 N2 n-grams length 1, 2, 3 plus POS tags Figure 1: For soft-LR (EM), area under precision N3 n-grams length 1, 2, 3 plus directionality and posi- tion from TARGET recall curve (AUPRC) results on the test set for N4 concatenated POS tags of 5-word window centered different inverse regularization values (C, the pa- on TARGET rameters’ prior variance). N5 word and POS tags for 5-word window centered on TARGET

Table 4: Feature templates for logistic regression The E-step requires calculating the marginal Features (D) posterior probability for each zi, grouped into syntactic dependencies and N- gram (ND1) features.length 3 dependency paths that include TARGET: word, POS, dep. label q(zi):=P (zi x (e ),yei ). (9) | M i D2 length 3 dependency paths that include TARGET: word and dep. label This corresponds to calculating the posterior prob- broadly similar to the n-gram and syntactic depen- D3 length 3 dependency paths that include TARGET: ability of a disjunct, given knowledge of the out- dency featuresword used and in POS previous work on feature- put of the disjunction, and prior probabilities of all basedD semantic4 all length parsing 2 dependency (e.g. Das paths et al. with(2014 word,); POS, dep. disjuncts (given by the mention-level classifier). Thomson etlabels al. (2014)). We use randomized fea- Since P (z x, y)=P (z,y x)/P (y x), ture hashing (Weinberger et al., 2009) to effi- | | | N1 n-grams length 1, 2, 3 ciently represent features in 450,000 dimensions, P (z =1,y =1x ) N2 n-grams length 1, 2, 3 plus POS tags i ei (ei) which achieved similar performance as an explicit Figureq 1:(zi For= 1) soft-LR = (EM), area| underM . precision(10) N3 n-grams length 1, 2, 3 plus directionality and posi- P (yei =1x (ei)) feature representation.tion from TARGET The model is trained with | M recall curve (AUPRC) results on the test set forscikit-learnN4 concatenated(Pedregosa et POS al., tags2011 of). 5-word10 For window EM centered differentThe numerator inverse regularization simplifies to the values mention (C predic-, the pa-(soft-LR) training,on TARGET the test set’s area under the pre- tion P (zi =1 xi) and the denominator is the N5 word and POS tags for 5-word window centered on rameters’ prior variance).| cision recall curve converges after 96 iterations entity-level noisyor probability (Eq. 6). This has (Fig. 1). TARGET the effect of taking the classifier’s predicted prob- EM Trainingability and increasing [Dempster it slightly et al. (since 1977] Eq. 10’s de- 4.5Table Convolutional 4: Feature neural templates network for logistic regression ThenominatorE-step isrequires no greater calculating than 1); thus the the disjunc- marginal z grouped into syntactic dependencies (D) and N- E-step:posterior posteriortion constraint probability inference implies for given a each soft at-least-one positivei, labeling. disjunction In We also train a convolutional neural network the case of a negative entity with ye =0, the dis- (CNN)gram classifier,(N) features. which uses word embeddings z and their nonlinear compositions to potentially junctionq( constraintzi):=P implies(zi x all (ei)(e,y) stayei ). clamped (9) to 0 as in the “hard” label| trainingM M method. generalize better than sparse lexical and n-gram This correspondsThe q(zi) posterior to calculating weights are the then posterior used for the prob-features.broadly CNNs similar have to been the n-gram shown useful and syntactic for depen- M-step: useM-step soft’s labels expected log-likelihood objective: sentence-leveldency features classification used in tasks previous (Kim, 2014 work; on feature- ability of a disjunct, given knowledge of the out-Zhang and Wallace, 2015), relation classification put of the disjunction, and prior probabilities of all based semantic parsing (e.g. Das et al. (2014); max q(zi = z) log P✓(zi = z xi). (Zeng et al., 2014) and, as in our case, event ✓ | disjuncts (giveni z by0,1 the mention-level classifier). detectionThomson (Nguyen et al. and(2014 Grishman))., We2015 use). randomized We fea- X 2X{ } Since P (z x, y)=P (z,y x)/P (y x(11)), use Kimture(2014 hashing)’s open-source (Weinberger CNN implementa- et al., 2009) to effi- | | | 11 • LogisticThis regression: objective full (plus M-step regularization) (convex opt., is L-BFGS) maximized tion, cientlywhere represent a logistic function features makes in 450,000 the final dimensions, Neuralwith network: gradient severalP ascent(z epochs=1 as before.,y of stochastic=1x gradient) descentmention prediction based on max-pooled values • i ei (ei) which achieved similar performance as an explicit q(z This= 1) approach = can be applied to| anyM mention-. (10)from convolutional layers of three different filter (Adagrad)i P (y =1x ) Similarlevel to: probabilistic Expected Conjugate model;ei Gradient we explore [Salakhutdinov(ei) two in et theal. 2003]sizes.feature We use pretrained representation. word embeddings The model for ini- is trained with • | M 10 • Stagednext initialization sections. (log.reg. training is nonrandom :) ) scikit-learn (Pedregosa et al., 2011). For EM The numerator simplifies to the mention predic- 10With(soft-LR)FeatureHasher training,, L2 regularization, the test set’s ‘lbfgs’ area solver, under the pre- and inverse strength C =0.1, tuned on a development dataset P4.4(z Feature-based=1 x ) logistic28 regression tion i i and the denominator is thein “hard”cision training; recall for EM curve training converges the same regularization after 96 iterations Thursday, April 27, 17 | entity-levelWe constructnoisyor hand-craftedprobability features (Eq. for regularized6). This hasstrength performs best. 11 (Fig. 1). the effectlogistic of regression taking the (LR) classifier’s (Table 4), designed predicted to be prob- https://github.com/yoonkim/CNN sentence ability and increasing it slightly (since Eq. 10’s de- 4.5 Convolutional neural network nominator is no greater than 1); thus the disjunc- tion constraint implies a soft positive labeling. In We also train a convolutional neural network the case of a negative entity with ye =0, the dis- (CNN) classifier, which uses word embeddings junction constraint implies all z (e) stay clamped and their nonlinear compositions to potentially to 0 as in the “hard” label trainingM method. generalize better than sparse lexical and n-gram The q(zi) posterior weights are then used for the features. CNNs have been shown useful for M-step’s expected log-likelihood objective: sentence-level classification tasks (Kim, 2014; Zhang and Wallace, 2015), relation classification max q(zi = z) log P✓(zi = z xi). (Zeng et al., 2014) and, as in our case, event ✓ | i z 0,1 detection (Nguyen and Grishman, 2015). We X 2X{ } (11) use Kim (2014)’s open-source CNN implementa- This objective (plus regularization) is maximized tion,11 where a logistic function makes the final with gradient ascent as before. mention prediction based on max-pooled values This approach can be applied to any mention- from convolutional layers of three different filter level probabilistic model; we explore two in the sizes. We use pretrained word embeddings for ini- next sections. 10With FeatureHasher, L2 regularization, ‘lbfgs’ solver, 4.4 Feature-based logistic regression and inverse strength C =0.1, tuned on a development dataset in “hard” training; for EM training the same regularization We construct hand-crafted features for regularized strength performs best. logistic regression (LR) (Table 4), designed to be 11https://github.com/yoonkim/CNN sentence Rule Prec. Recall F1 SEMAFOR R1 0.011 0.436 0.022 R2 0.031 0.162 0.051 R3 0.098 0.009 0.016 RPI-JIE R1 0.016 0.447 0.030 R2 0.044 0.327 0.078 R3 0.172 0.168 0.170 Data upper bound ( 4.6) 1.0 0.57 0.73 Results § Table 6: Precision, recall, and F1 scores for test data using event extractors SEMAFOR and RPI- JIE and rules R1-R3 described below.

Rule Prec. Recall F1 xi. We want the system to detect (1) killing events, Figure 4: Precision-recall curves for the given where (2) the killed person is the target mention i, SEMAFOR R1 0.011 0.436 0.022 models. and (3) the person who killed them is a police of- R2 0.031 0.162 0.051 ficer. We implement a small progression of these R3 0.098 0.009 0.016 Model AUPRC F1 neo-Davidsonian (Parsons, 1990) conjuncts with 15 hard-LR, dep. feats. 0.117 0.229 rules to classify zi =1if: RPI-JIE R1 0.016 0.447 0.030 hard-LR, n-gram feats. 0.134 0.257 R2 0.044 0.327 0.078 (R1) the event type is ‘kill.’ hard-LR, all feats. 0.142 0.266 • R3 0.172 0.168 0.170 hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span • Data upper bound ( 4.6) 1.0 0.57 0.73 contains ei. § soft-CNN (EM) 0.164 0.267 (R3) R2 holds and the agent token span con- soft-LR (EM) 0.193 0.316 • tains a police keyword. Table 6: Precision, recall, and F1 scores for test Data upper bound ( 4.6) 0.57 0.73 § As in 4.1, we aggregate mention-level z predic- data using event extractors SEMAFOR and RPI- § i Table 5: Area under precision-recall curve tions to obtain entity-level predictions with a de- JIE and rules R1-R3 described below. (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, x . We want the system to detect (1) killing events, notate a small corpus, and craft supervised learn- i Figurethough 4:our task-specific Precision-recall model still outperforms curves for the given ing systems to predict event parses of documents. where (2) the killed person is the target mention i, (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf models.29 systems heavily rely on their annotated training and (3) the person who killed them is a police of- Thursday,event April extractors 27, 17 that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the ficer. We implement a small progression of these new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- Model AUPRC F1 news, suggesting domain transfer for future work. neo-Davidsonian (Parsons, 1990) conjuncts with JIE) (Li and Ji, 2014), which output semantic 15 rules to classify zi =1if: structures following the FrameNet (Fillmore et al., 6hard-LR, Results and dep. discussion feats. 0.117 0.229 2003) and ACE (Doddington et al., 2004) event hard-LR, n-gram feats. 0.134 0.257 ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate (R1) the event type is ‘kill.’ use RPI-JIE to identify instances of gun violence. ourhard-LR, model is better all feats. than existing methods to 0.142 ex- 0.266 • For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M hard-CNN 0.130 0.252 (R2) R1 holds and the patient token span and RPI-JIE to extract event tuples of the form 15 • For SEMAFOR, we use the FrameNet ‘Killing’ frame e ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we contains i. usesoft-CNN the ACE ‘life/die’ (EM) event type/subtype with roles ‘victim’ 0.164 0.267 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- (R3) R2 holds and the agent token span con- structures in text, but with lighter ontologies where event gument;soft-LR RPI-JIE/ACE (EM) defines two spans, both a head 0.193 word 0.316 • classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only tains a police keyword. PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predictsData spans upper as event bound arguments, ( while4.6 RPI-JIE) also 0.57 pre- 0.73 Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and§ gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For As in 4.1, we aggregate mention-level zi predic- we need an extraction system to handle different trigger con- Tabledetermining 5: R2 and Area R3, we allow under a match precision-recall on any of an en- curve § structions like “killed” versus “shot dead.” tity’s extents. tions to obtain entity-level predictions with a de- (AUPRC) and F1 (its maximum value from the PR terministic OR of z (e). curve) for entity prediction on the test set. RPI-JIE under theM full R3 system performs best, though all results are relatively poor (Table 6). Part of this is due to inherent difficulty of the task, notate a small corpus, and craft supervised learn- though our task-specific model still outperforms ing systems to predict event parses of documents. (Table 5). We suspect a major issue is that these We evaluate two freely available, off-the-shelf systems heavily rely on their annotated training event extractors that were developed under this sets and may have significant performance loss on paradigm: SEMAFOR (Das et al., 2014), and the new domains, or messy text extracted from web RPI Joint Information Extraction System (RPI- news, suggesting domain transfer for future work. JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 6 Results and discussion 2003) and ACE (Doddington et al., 2004) event ontologies, respectively.14 Pavlick et al. (2016) Usefulness for practitioners: Our results indicate use RPI-JIE to identify instances of gun violence. our model is better than existing methods to ex- For each mention i we use SEMAFOR tract names of people killed by police. Comparing 2 M and RPI-JIE to extract event tuples of the form 15For SEMAFOR, we use the FrameNet ‘Killing’ frame ti = (event type, agent, patient) from the sentence with frame elements ‘Victim’ and ‘Killer’. For RPI-JIE, we use the ACE ‘life/die’ event type/subtype with roles ‘victim’ 14Many other annotated datasets encode similar event and ‘agent’. SEMAFOR defines a token span for every ar- structures in text, but with lighter ontologies where event gument; RPI-JIE/ACE defines two spans, both a head word classes directly correspond with lexical items—including and entity extent; we use the entity extent. SEMAFOR only PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract predicts spans as event arguments, while RPI-JIE also pre- Meaning Representation (Kingsbury and Palmer, 2002; Hajic dicts entities as event arguments and gives each a within-text et al., 2012; Oepen et al., 2014; Banarescu et al., 2013). We coreference chain; since we only use single sentences, these assume such systems are too narrow for our purposes, since tend to be small, but it does sometimes resolve pronouns. For we need an extraction system to handle different trigger con- determining R2 and R3, we allow a match on any of an en- structions like “killed” versus “shot dead.” tity’s extents. EM Training Logistic regression Features D1 length 3 dependency paths that include TARGET: word, POS, dep. label D2 length 3 dependency paths that include TARGET: word and dep. label D3 length 3 dependency paths that include TARGET: word and POS D4 all length 2 dependency paths with word, POS, dep. labels N1 n-grams length 1, 2, 3 N2 n-grams length 1, 2, 3 plus POS tags Figure 1: For soft-LR (EM), area under precision N3 n-grams length 1, 2, 3 plus directionality and posi- 30 tion from TARGET Thursday, April 27, 17 recall curve (AUPRC) results on the test set for N4 concatenated POS tags of 5-word window centered different inverse regularization values (C, the pa- on TARGET rameters’ prior variance). N5 word and POS tags for 5-word window centered on TARGET

The E-step requires calculating the marginal Table 4: Feature templates for logistic regression (D) posterior probability for each zi, grouped into syntactic dependencies and N- gram (N) features.

q(zi):=P (zi x (e ),yei ). (9) | M i This corresponds to calculating the posterior prob- broadly similar to the n-gram and syntactic depen- ability of a disjunct, given knowledge of the out- dency features used in previous work on feature- put of the disjunction, and prior probabilities of all based semantic parsing (e.g. Das et al. (2014); disjuncts (given by the mention-level classifier). Thomson et al. (2014)). We use randomized fea- Since P (z x, y)=P (z,y x)/P (y x), ture hashing (Weinberger et al., 2009) to effi- | | | ciently represent features in 450,000 dimensions, P (z =1,y =1x ) i ei (ei) which achieved similar performance as an explicit q(zi = 1) = | M . (10) P (yei =1x (e )) feature representation. The model is trained with | M i scikit-learn (Pedregosa et al., 2011).10 For EM The numerator simplifies to the mention predic- (soft-LR) training, the test set’s area under the pre- tion P (z =1 x ) and the denominator is the i | i cision recall curve converges after 96 iterations entity-level noisyor probability (Eq. 6). This has (Fig. 1). the effect of taking the classifier’s predicted prob- ability and increasing it slightly (since Eq. 10’s de- 4.5 Convolutional neural network nominator is no greater than 1); thus the disjunc- tion constraint implies a soft positive labeling. In We also train a convolutional neural network the case of a negative entity with ye =0, the dis- (CNN) classifier, which uses word embeddings junction constraint implies all z (e) stay clamped and their nonlinear compositions to potentially to 0 as in the “hard” label trainingM method. generalize better than sparse lexical and n-gram The q(zi) posterior weights are then used for the features. CNNs have been shown useful for M-step’s expected log-likelihood objective: sentence-level classification tasks (Kim, 2014; Zhang and Wallace, 2015), relation classification max q(zi = z) log P✓(zi = z xi). (Zeng et al., 2014) and, as in our case, event ✓ | i z 0,1 detection (Nguyen and Grishman, 2015). We X 2X{ } (11) use Kim (2014)’s open-source CNN implementa- This objective (plus regularization) is maximized tion,11 where a logistic function makes the final with gradient ascent as before. mention prediction based on max-pooled values This approach can be applied to any mention- from convolutional layers of three different filter level probabilistic model; we explore two in the sizes. We use pretrained word embeddings for ini- next sections. 10With FeatureHasher, L2 regularization, ‘lbfgs’ solver, 4.4 Feature-based logistic regression and inverse strength C =0.1, tuned on a development dataset in “hard” training; for EM training the same regularization We construct hand-crafted features for regularized strength performs best. logistic regression (LR) (Table 4), designed to be 11https://github.com/yoonkim/CNN sentence 12 June 6, Nov. 22, tialization, and update them during training. We date of killing 2014 2016 also add two special vectors for the TARGET and e1 e2 knowledge base PERSON symbols, initialized randomly.13

For training, we perform stochastic gradient de- e1 = e2 = entity labels “historical” “positive” scent for the negative expected log-likelihood (Eq. test set 11) by sampling with replacement fifty mention- e1 e2 label pairs for each minibatch, choosing each train/test Oct. 3, Dec. 1, date of news report 2016 (i, k) 0, 1 with probability proportional split 2016 2 M ⇥ { } to q(zi = k). This strategy attains the same ex- pected gradient as the overall objective. We use Figure 2: At test time, there are matches between “epoch” to refer to training on 265,700 examples the knowledge base and the news reports both for (approx. twice the number of mentions). Unlike persons killed during the test period (“positive”) EM for logistic regression, we do not run gradi- and persons killed before it (“historical”). Histori- ent descent to convergence, instead applying anEM E- Trainingcal cases are excluded Neural from network evaluation. step every two epochs to update q; this approach is related to incremental and online variants of EM (Neal and Hinton, 1998; Liang and Klein, 2009), and is justified since both SGD and E- steps improve the evidence lower bound (ELBO); there is no need to run gradient descent to conver- gence. We are not aware of recent work that uses EM to train latent-variable neural network models, though previous work has explored this combina- tion (e.g. Jordan and Jacobs (1994))

4.6 Evaluation Figure 3: Test set AUPRC for three runs of soft- On documents from the test period (Sept–Dec CNN (EM) (blue, higher in graph), and hard-CNN 2016), our models predict entity-level labels (red, lower in graph). Darker lines show perfor- P (ye =1 x (e)) (Eq. 6), and we wish to eval- | M mance of averaged predictions. uate whether retrieved entities are listed in Fatal Encounters as being killed during Sept–Dec 2016. We rank entities by predicted probabilities to con- only contained 258 (Table31 2), hence the data up- Thursday, April 27, 17 struct a precision-recall curve (Fig. 4, Table 5). per bound of 0.57 recall, which also gives an up- Area under the precision-recall curve (AUPRC) is per bound of 0.57 on AUPRC. This is mostly a calculated with a trapezoidal rule; F1 scores are limitation of our news corpus; though we collect shown for convenient comparison to non-ranking hundreds of thousands of news articles, it turns approaches ( 5). out Google News only accesses a subset of rele- § Excluding historical fatalities: Our model vant web news, as opposed to more comprehensive gives strong positive predictions for many people data sources manually reviewed by Fatal Encoun- who were killed by police before the test period ters’ human experts. We still believe our dataset is (i.e. before Sept 2016), when news articles con- large enough to be realistic for developing better tain discussion of historical police killings. We methods, and expect the same approaches could exclude these entities from evaluation, since we be applied to a more comprehensive news corpus. want to simulate an update to a fatality database (Fig 2). Our test dataset contains 1,148 such his- 5 Off-the-shelf event extraction baselines torical entities. From a practitioner’s perspective, a natural first Data upper bound: Of the 452 gold entities in the FE database at test time, our news corpus approach to this task would be to run the corpus of police fatality documents through pre-trained, 12From the same word2vec embeddings used in 3. “off-the-shelf” event extractor systems that could 13 § Training proceeds with Adagrad (Duchi et al., 2011). We identify killing events. In modern NLP research, tested several different settings of dropout and L2 regulariza- tion hyperparameters on a development set, but found mixed a major paradigm for event extraction is to formu- results, so used their default values. late a hand-crafted ontology of event classes, an- Predictions

entity (e) ment.(i) ment. text (xi) prob. Keith Scott 0.98 Charlotte protests Charlotte’s Mayor Jennifer Roberts speaks to reporters the morning after (true pos) protests against the police shooting of Keith Scott, in Charlotte, North Carolina . Terence 0.96 Tulsa Police Department released video footage Monday, Sept. 19, 2016, showing white Tulsa Crutcher police officer Betty Shelby fatally shooting Terence Crutcher, 40, a black man police later (true pos) determined was unarmed. Mark Duggan 0.97 The fatal by police led to some of the worst in England’s recent (false pos) history. Logan Clarke 0.92 Logan Clarke was shot by a campus police officer after waving kitchen knives at fellow stu- (false pos) dents outside the cafeteria at Hug High School in Reno, Nevada, on December 7.

Table 7: Example of highly ranked entities, with selected mention predictions and text.

F1 scores, our distant supervision models outper- D4) caused the logistic regression models to out- form the off-the-shelf event extractors (Table 5 vs. perform the neural networks. We hypothesize Table 6). We also compare entities extracted from these dependency features capture longer-distance our test dataset to the Guardian’s “The Counted” semantic relationships between the entity, fatal- database of U.S. police killings during the span of ity trigger word, and police officer, which short the test period (Sept.–Dec., 2016),16 and found 39 32n-grams cannot. Moving to sequence or graph Thursday,persons April 27, they 17 did not include in the database, but LSTMs may better capture such dependencies. who were in fact killed by police. This implies our Manual analysis: Manual analysis of false approach could in fact augment journalistic col- positives indicates misspellings or mismatches of lection efforts. Additionally, our model will help names, police fatalities outside of the U.S., peo- practitioners by presenting them with sentence- ple who were shot by police but not killed, and level information in the form of Table 7; we expect names of police officers who were killed are com- this would significantly decrease the amount of mon false positive errors (See detailed table in the manual hours and emotional toll required to main- appendix). This analysis suggests many prediction tain real-time updates of police fatality databases. errors are from ambiguous or challenging cases.17 CNN: Model predictions were relatively un- Future work: While we have made progress stable during the training process. Despite the on this application, more work is necessary for ac- fact that EM’s evidence lower bound objective curacy to be high enough to be useful for practi- (H(Q)+EQ[log P (Z, Y X)]) converged fairly tioners. Our model allows for the use of mention- | well on the training set, test set AUPRC substan- level semantic parsing models; systems with ex- tially fluctuated as much as 2% between epochs, plicit trigger/agent/patient representations, more and also between three different random initial- like traditional event extraction systems, may be izations for training (Fig. 3). We conducted these useful, as would more sophisticated neural net- multiple runs initially to check for variability, then work models, or attention models as an alternative used them to construct a basic ensemble: we aver- to disjunction aggregation (Lin et al., 2016). aged the three models’ mention-level predictions Furthermore our dataset could be used to ex- before applying noisyor aggregation. This outper- plore the dynamics of media attention (for exam- formed the individual models—especially for EM ple, the effect of race and geography on cover- training—and showed less fluctuation in AUPRC, age of police killings), and discussions of histor- which made it easier to detect convergence. Re- ical killings in news—for example, many articles ported performance numbers in Table 5 are with in 2016 discussed Michael Brown’s 2014 death in the average of all three runs from the final epoch Ferguson, Missouri. Improving NLP analysis of of training. historical events would also be useful for the event LR vs. CNN: After feature ablation we found extraction task itself. It may also be possible to that hard-CNN and hard-LR with n-gram fea- adapt our model to extract other types of social tures (N1-N5) had comparable AUPRC values behavior events. (Table 5). But adding dependency features (D1- 17We attempted to correct non-U.S. false positive errors 16https://www.theguardian.com/us-news/series/ by using CLAVIN, an open-source country identifier, but this counted-us-police-killings, downloaded Jan. 1, 2017. significantly hurt recall. Predictions: top-ranked duplicates within a given month of data; this could rank name positive analysis be improved or modified in future work. This pro- 1 Keith Scott true cesses removes 103,004 near-duplicate articles. 2 Terence Crutcher true 3 Alfred Olango true 4 Deborah Danner true C Mention-level preprocessing 5 Carnell Snell true 6 Kajuan Raye true From the corpus of scraped news documents, to 7 Terrence Sterling true create the mention-level dataset (i.e. the set of sen- 8 Francisco Serna true 9 Sam DuBose false name mismatch tences containing candidate entities) we : 10 Michael Vance true 11 Tyre King true 1. Apply the Lynx text-based web browser to 12 Joshua Beal true extract all a webpage’s text. 13 false killed, not by police 14 Mark Duggan false non-US 15 Kirk Figueroa true 2. Segment sentences in two steps: 16 Anis Amri false non-US 17 Logan Clarke false shot not killed (a) Segment documents to fragments of text 18 Craig McDougall false non-US (typically, paragraphs) by splitting on 19 Frank Clark true Lynx’s representation of HTML para- 20 Benjamin Marconi false name of officer graph, list markers, and other dividers: Table 8: Top 20 entity predictions given by double newlines and the characters -,*, soft-LR (excluding historical entities) evaluated , + and #. 33 | as “true” or “false” based on matching the gold Thursday, April 27, 17 (b) Apply spaCy’s sentence segmenter (and knowledge base. False positives were manually NLP pipeline) to these paragraph-like analyzed. See Table 7 in the main paper for more text fragments. detailed information regarding bold-faced entities. 3. De-duplicate sentences as described in detail below. once for a given entity in a given month. We re- 4. Remove sentences that have less than 5 to- move these duplicate sentences from our mention- kens or more than 200. level dataset, eliminating all but one sentence, se- 5. Remove entities (and associated mentions) lecting the earliest by download time of its scraped that webpage. (a) Contain punctuation (except for periods, D noisyor numerical stability hyphens and apostrophes). Under “hard” training, many entities at test time (b) Contain numbers. have probabilities very close to 1; in some cases, (c) Are one token in length. higher than 1 e 1000. This happens for enti- ties with a very large number of mentions, where 6. Strip any “’s” occurring at the end of named the naive implementation of noisyor as p =1 entity spans. (1 p ) has numerical underflow and thus many i i ties at 1, such that randomly breaking the ties can 7. Strip titles (i.e. Ms., Mr. Sgt., Lt.) occurring Q in entity spans. (HAPNIS sometimes iden- affect AUPRC. (Note floating point numbers have tifies these types of titles; this step basically worse tolerance near 1 than near 0.) augments its rules.) Instead, we rank entity predictions by the log of the complement probability (i.e. 1000 for p = 1000 8. Filter to mentions that contain at least one 1 e ): police keyword and at least one fatality key- word. log 1 P (ye =1 x (e)) = log P (zi =0 xi) | M | i X We define sentence-level duplicates as cases where (1) the exact sentence is included more E Manual analysis of results than once among mentions for a given entity, and Manual analysis is available in Table 8. (2) cases where nearly the exact same sentence (token-level Jaccard .9) is included more than Conclusions

• Natural language processing can help acquire more behavioral data from news • International relations [Schrodt and Gerner, 1994; Schrodt, 2012; Boschee et al., 2013; O’Connor et al., 2013; Gerrish, 2013] • Protests [Hanna 2017] • Gun violence [Pavlick et al. 2016] • Define a new corpus-level event extraction task • Planning to release data. Any interest in shared task? • Task/domain-specific approach for the “database update” problems • How to generalize across many questions / event types? • Assumes media production reflects reality. Alternative: analyze e.g. media bias/attention • Typical approach in political science or literature content analysis • NLP and social analysis • Concrete, real-world tasks useful testbed for NLP research • NLP could offer something useful for important tasks! 34

Thursday, April 27, 17