Learning to Extract International Relations from Political Context
Total Page:16
File Type:pdf, Size:1020Kb
Learning to Extract International Relations from Political Context Brendan O’Connor Brandon M. Stewart Noah A. Smith School of Computer Science Department of Government School of Computer Science Carnegie Mellon University Harvard University Carnegie Mellon University Pittsburgh, PA 15213, USA Cambridge, MA 02139, USA Pittsburgh, PA 15213, USA [email protected] [email protected] [email protected] Abstract model of the relationship between a pair of polit- ical actors imposes a prior distribution over types We describe a new probabilistic model of linguistic events. Our probabilistic model in- for extracting events between major polit- fers latent frames, each a distribution over textual ical actors from news corpora. Our un- expressions of a kind of event, as well as a repre- supervised model brings together famil- sentation of the relationship between each political iar components in natural language pro- actor pair at each point in time. We use syntactic cessing (like parsers and topic models) preprocessing and a logistic normal topic model, with contextual political information— including latent temporal smoothing on the politi- temporal and dyad dependence—to in- cal context prior. fer latent event classes. We quantita- tively evaluate the model’s performance We apply the model in a series of compar- on political science benchmarks: recover- isons to benchmark datasets in political science. ing expert-assigned event class valences, First, we compare the automatically learned verb and detecting real-world conflict. We also classes to a pre-existing ontology and hand-crafted 1 conduct a small case study based on our verb patterns from TABARI, an open-source and model’s inferences. widely used rule-based event extraction system for this domain. Second, we demonstrate correlation A supplementary appendix, and replica- to a database of real-world international conflict tion software/data are available online, at: events, the Militarized Interstate Dispute (MID) http://brenocon.com/irevents dataset (Jones et al., 1996). Third, we qualitatively 1 Introduction examine a prominent case not included in the MID dataset, Israeli-Palestinian relations, and compare The digitization of large news corpora has pro- the recovered trends to the historical record. vided an unparalleled opportunity for the system- atic study of international relations. Since the mid- We outline the data used for event discovery ( 2), describe our model ( 3), inference ( 4), eval- 1960s political scientists have used political events § § § uation ( 5), and comment on related work ( 6). data, records of public micro-level interactions be- § § tween major political actors of the form “someone does something to someone else” as reported in 2 Data the open press (Schrodt, 2012), to study the pat- terns of interactions between political actors and The model we describe in 3 is learned from a § how they evolve over time. Scaling this data effort corpus of 6.5 million newswire articles from the to modern corpora presents an information extrac- English Gigaword 4th edition (1994–2008, Parker tion challenge: can a structured collection of ac- et al., 2009). We also supplement it with a sam- curate, politically relevant events between major ple of data from the New York Times Annotated political actors be extracted automatically and ef- Corpus (1987–2007, Sandhaus, 2008).2 The Stan- ficiently? And can they be grouped into meaning- ful event types with a low-dimensional structure 1Available from the Penn State Event Data Project: useful for further analysis? http://eventdata.psu.edu/ 2 We present an unsupervised approach to event For arbitrary reasons this portion of the data is much smaller (we only parse the first five sentences of each arti- extraction, in which political structure and linguis- cle, while Gigaword has all sentences parsed), resulting in tic evidence are combined. A political context less than 2% as many tuples as from the Gigaword data. 1094 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1094–1104, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics ford CoreNLP system,3 under default settings, was prep with and fight prep alongside (“talk with R,” “fight ←−−−− ←−−−−−− used to POS-tag and parse the articles, to eventu- alongside R”) but many interesting multiword con- ally produce event tuples of the form structions also result, such as reject dobj allegation poss ←−− s, r, t, wpredpath (“rejected R’s allegation”) or verb chains as ←−− xcomp dobj h i in offer help (“offer to help R”). where s and r denote “source” and “receiver” ar- ←−− ←−− We wish to focus on instances of directly re- guments, which are political actor entities in a pre- ported events, so attempt to remove factively com- defined set , t is a timestep (i.e., a 7-day pe- E plicated cases such as indirect reporting and hy- riod) derived from the article’s published date, and potheticals by discarding all predicate paths for w is a textual predicate expressed as a de- predpath which any verb on the path has an off-path gov- pendency path that typically includes a verb (we erning verb with a non-conj relation. (For exam- use the terms “predicate-path” and “verb-path” in- ple, the verb at the root of a sentence always sur- terchangeably). For example, on January 1, 2000, vives this filter.) Without this filter, the s, r, w the AP reported “Pakistan promptly accused In- h i tuple USA, CUB, want xcomp seize dobj is ex- dia,” from which our preprocessing extracts the tu- h ←−− ←−−i dobj tracted from the sentence “Parliament Speaker Ri- ple PAK, IND, 678, accuse . (The path ex- h ←−−i cardo Alarcon said the United States wants to seize cludes the first source-side arc.) Entities and verb Cuba and take over its lands”; the filter removes paths are identified through the following sets of it since wants is dominated by an off-path verb rules. ccomp through say wants. The filter was iteratively Named entity recognition and resolution is done ←−− developed by inspecting dozens of output exam- deterministically by finding instances of country ples and their labelings under successive changes names from the CountryInfo.txt dictionary from to the rules. TABARI,4 which contains proper noun and adjec- Finally, only paths length 4 or less are allowed, tival forms for countries and administrative units. the final dependency relation for the receiver may We supplement these with a few entries for in- not be nsubj or agent, and the path may not contain ternational organizations from another dictionary any of the dependency relations conj, parataxis, provided by the same project, and clean up a few det, or dep. We use lemmatized word forms in ambiguous names, resulting in a final actor dictio- defining the paths. nary of 235 entities and 2,500 names. Several document filters are applied before tu- Whenever a name is found, we identify its en- ple extraction. Deduplication removes 8.5% of ar- tity’s mention as the minimal noun phrase that ticles.5 For topic filtering, we apply a series of contains it; if the name is an adjectival or noun- keyword filters to remove sports and finance news, noun compound modifier, we traverse any such and also apply a text classifier for diplomatic and amod and nn dependencies to the noun phrase military news, trained on several hundred man- head. Thus NATO bombing, British view, and ually labeled news articles (using ` -regularized Palestinian militant resolve to the entity codes IG- 1 logistic regression with unigram and bigram fea- ONAT, GBR, and PSE respectively. tures). Other filters remove non-textual junk and We are interested in identifying actions initi- non-standard punctuation likely to cause parse er- ated by agents of one country targeted towards an- rors. other, and hence concentrate on verbs, analyzing For experiments we remove tuples where the the “CCprocessed” version of the Stanford Depen- source and receiver entities are the same, and re- dencies (de Marneffe and Manning, 2008). Verb strict to tuples with dyads that occur at least 500 paths are identified by looking at the shortest de- times, and predicate paths that occur at least 10 pendency path between two mentions in a sen- times. This yields 365,623 event tuples from tence. If one of the mentions is immediately dom- 235,830 documents, for 421 dyads and 10,457 inated by a nsubj or agent relation, we consider unique predicate paths. We define timesteps that the Source actor, and the other mention is the to be 7-day periods, resulting in 1,149 discrete Receiver. The most common cases are simple di- rect objects and prepositional arguments like talk 5We use a simple form of shingling (ch. 3, Rajaraman and Ullman, 2011): represent a document signature as its J = 5 3http://nlp.stanford.edu/software/ lowercased bigrams with the lowest hash values, and reject a corenlp.shtml document with a signature that has been seen before within 4http://eventdata.psu.edu/software. the same month. J was manually tuned, as it affects the pre- dir/dictionaries.html. cision/recall tradeoff. 1095 2 For each frame k, draw a multinomial distri- ⌧ • bution of dependency paths, φ Dir(b/V ) k ∼ V βk,s,r,t 1 βk,s,r,t (where is the number of dependency path − k k types). 2 For each (s, r, t), for every event tuple i in σk • that context, ↵k ⌘k,s,r,t (i) Context Model | Context) Type P(Event Sample its frame z Mult(θ ). k • ∼ s,r,t Sample its predicate realization • (i) ✓s,r,t w Mult(φ (i) ). predpath ∼ z s "Source" entity s Thus the language model is very similar to a topic r "Receiver" z r model’s generation of token topics and wordtypes.