SOCS Doctoral Symposium, June 27, 2013
Total Page:16
File Type:pdf, Size:1020Kb
Learning to Extract International Relations from News Text Brendan O’Connor Machine Learning Department Carnegie Mellon University Presentation: NSF SOCS Doctoral Symposium, June 27, 2013 Forthcoming, ACL 2013. Joint work with Brandon Stewart (political science, Harvard) Noah Smith (CMU) Paper and other information at: http://brenocon.com/irevents/ 1 Thursday, June 27, 13 Computational Social Science Thursday, June 27, 13 Computational Social Science 1890 Census tabulator - solved 1880’s data deluge Computation as a tool for social science applications Thursday, June 27, 13 Automated Text Analysis • Textual media: news, books, articles, internet, messages... • Automated content analysis: tools for discovery and measurement of concepts, attitudes, events • Natural language processing, information retrieval, data mining, and machine learning as quantitative social science methodology 4 Thursday, June 27, 13 International Relations Event Data Extracted from news text http://gdelt.utdallas.edu 5 Thursday, June 27, 13 Previous work: knowledge engineering approach Open-source TABARI software and ontology/patterns ~15000 verb patterns, ~200 event classes (Schrodt 1994..2012; ontology goes back to 1960’s) 03 - EXPRESS INTENT TO COOPERATE Event types 07 - PROVIDE AID 15 - EXHIBIT MILITARY POSTURE 191 - Impose blockade, restrict movement Verb patterns not_ allow to_ enter ;mj 02 aug 2006 per event type barred travel block traffic from ;ab 17 nov 2005 block road ;hux 1/7/98 Extract events from news text 6 Thursday, June 27, 13 Previous work: knowledge engineering approach Open-source TABARI software and ontology/patterns ~15000 verb patterns, ~200 event classes (Schrodt 1994..2012; ontology goes back to 1960’s) 03 - EXPRESS INTENT TO COOPERATE Event types 07 - PROVIDE AID 15 - EXHIBIT MILITARY POSTURE 191 - Impose blockade, restrict movement Verb patterns not_ allow to_ enter ;mj 02 aug 2006 per event type barred travel block traffic from ;ab 17 nov 2005 block road ;hux 1/7/98 Extract events from news text Issues: 1. Hard to maintain and adapt to new domains 2. Precision is low (Boschee et al 2013) 6 Thursday, June 27, 13 Our approach • Joint learning for high-level summary of event timelines • 1. Automatically learn the verb ontology • 2. Extract events / political dynamics • Social context to drive unsupervised learning about language 7 Thursday, June 27, 13 Newswire entity/predicate data • 6.5 million news articles, 1987-2008 • Focus on events between two actors: (SourceEntity, ReceiverEntity, Time, wpredpath) • “Pakistan promptly accused India” [1/1/2000] => (PAK, IND, 268, SRC -nsubj> accuse <dobj- REC) • Named entities: dictionary of country names • Predicate paths: where verb dominates Source in subject position. Receiver most commonly directobj, prepobj constructions (some others too) Thursday, June 27, 13 Newswire entity/predicate data Very rare to see parsers in text-as-data studies. Parsers are slow, hard to use, and make errors. • Entities: as noun phrases ROOT Events: as verbs and arguments • S • Co-occurrence has low precision NP VP . • Preprocess with Stanford CoreNLP for NP PP VBD NP PP . part-of-speech tags and syntactic DT NN IN NP saw PRP IN NP dependencies The cat in DT NN him with DT NN • Filters for topics, factivity, verb-y paths, the hat a telescope and parse quality • Makes unsupervised learning easier: verb-argument information decides which words represent the event 9 Thursday, June 27, 13 Vanilla Model Model Independent contexts Frame learning from One (s,r,t) slice of graphical model verb co-occurrence within contexts 2 Thus the vector ⇤ ,s,r,t encodes the relative log- τ ⇥ odds of the different frames for events appearing Frame prior βk,s,r,t 1 βk,s,r,t in the context (s, r, t). This simple logistic nor- − k k mal prior is, inlog-odds terms of in topic models,Overall analogousprevalence of ype | ype Context) 2 T to the asymmetricthis context Dirichlet priorframe version (event of LDA class) σk inFrameWallach prioret in al. (2009), since the αk parameter αk ηk,s,r,t can learn that some frames tend⌘ to be moreN( likely↵ , σ2) Context Model P(Event this context k,s,r,t k k k ⇠2 than others. The variance parameter ⇧k controls admixture sparsity, analogousexp to⌘k,s,r,t a Dirichlet’s con- θs,r,t ✓k,s,r,t = centration parameter. K ype) exp ⌘ s "Source" k0=1 k0,s,r,t entity s T r "Receiver" z 3.1 Smoothing FramesP Across Time r z ✓s,r,t entity The vanilla model⇠ is capable of inducing frames t Timestep w φz i Event tuple wpredpath | Event ext through dependency⇠ path co-occurences, when k Frame i Language Model P(T multiple events occur in a given context. How- ever, many dyad-time slices are very sparse; for β φ example,Training: of the blocked 739 weeks Gibbs insampling the dataset, the most prevalent(Markov dyadChain (ISR-PSE) Monte Carlo) has a nonzero Figure 1: Directed probabilistic diagram of the event count in 525 of them. Only 104 directed- 10 modelThursday, for June one 27, 13(s, r, t) dyad-time context, for the dyads have more than 25 nonzero weeks. One smoothed model. solution is to increase the bucket size (e.g., to months); however, previous work in political sci- ence has demonstrated that answer questions of in- The context model generates a frame prior ⌅s,r,t • for every context (s, r, t). terest about reciprocity dynamics require recover- ing the events at weekly or even daily granularity Language model: • (Shellman, 2004), and in any case wide buckets Draw lexical sparsity parameter ⇥ from a dif- help only so much for dyads with fewer events • fuse prior (see 4). or less media attention. Therefore we propose a § (SF) model, in which the frame For each frame k, draw a multinomial distri- smoothed frames • distribution for a given dyad comes from a la- bution of dependency paths, ⌥k Dir(⇥). ⇥ tent parameter ⇥ ,s,r,t that smoothly varies over For each (s, r, t), for every event tuple i in ⇥ • time. For each (s, r), draw the first timestep’s val- that context, ues as ⇥k,s,r,1 N(0, 100), and for each context Sample its frame z(i) Mult(⌅ ). ⇥ • ⇥ s,r,t (s, r, t > 1), Sample its predicate realization • (i) 2 w Mult(⌥z). Draw ⇥k,s,r,t N(⇥k,s,r,t 1, ⌃ ) predpath ⇥ • ⇥ − 2 Draw ⇤k,s,r,t N(αk + ⇥k,s,r,t, ⇧k) Thus the language model is very similar to a topic • ⇥ 2 models’ generation of token topics and wordtypes. Other parameters (αk, ⇧k) are same as the vanilla We use structured logistic normal distributions model. This model assumes a random walk pro- to represent contextual effects. The simplest is the cess on ⇥, a variable which exists even for contexts vanilla (V) context model, that contain no events. Thus inferences about ⇤ will be smoothed according to event data at nearby For each frame k, draw global parameters from timesteps. This is an instance of a linear Gaus- • 2 diffuse priors: prevalence αk and variability ⇧k. sian state-space model (also known as a linear dy- For each (s, r, t), namical system or dynamic linear model), and is a • convenient formulation because it has well-known 2 Draw ⇤k,s,r,t N(αk, ⇧ ) for each frame k. exact inference algorithms. This parameterization • ⇥ k Apply a softmax transform, of ⇤ is related to one of the topic models pro- • posed in Blei and Lafferty (2006), though at a dif- exp ⇤k,s,r,t ⌅k,s,r,t = ferent structural level and using a different infer- K exp ⇤ k =1 k0,s,r,t ence technique ( 4). This also draws on models 0 § Smoothed Model Vanilla Model Linear dynamical system Independent contexts Model (Random walk) Frame learning from verb co-occurrence One (s,r,t) slice of graphical model βk,s,r,1 N(0, 100) ⇠ 2 within contexts βk,s,r,t N(βk,s,r,t 1,⌧) 2 ⇠ − Thus the vector ⇤ ,s,r,t encodes2 the relative log- τ ⌘k,s,r,t N(↵k + βk,s,r,t⇥ , σk) odds⇠ of the different frames for events appearing Frame prior βk,s,r,t 1 βk,s,r,t in the context (s, r, t). This simple logistic nor- − k k mal prior is, inlog-odds terms of in topic models,Overall analogousprevalence of ype | ype Context) 2 T to the asymmetricthis context Dirichlet priorframe version (event of LDA class) σk inFrameWallach prioret in al. (2009), since the αk parameter αk ηk,s,r,t can learn that some frames tend⌘ to be moreN( likely↵ , σ2) Context Model P(Event this context k,s,r,t k k k ⇠2 than others. The variance parameter ⇧k controls admixture sparsity, analogousexp to⌘k,s,r,t a Dirichlet’s con- θs,r,t ✓k,s,r,t = centration parameter. K ype) exp ⌘ s "Source" k0=1 k0,s,r,t entity s T r "Receiver" z 3.1 Smoothing FramesP Across Time r z ✓s,r,t entity The vanilla model⇠ is capable of inducing frames t Timestep w φz i Event tuple wpredpath | Event ext through dependency⇠ path co-occurences, when k Frame i Language Model P(T multiple events occur in a given context. How- ever, many dyad-time slices are very sparse; for β φ example,Training: of the blocked 739 weeks Gibbs insampling the dataset, the most prevalent(Markov dyadChain (ISR-PSE) Monte Carlo) has a nonzero Figure 1: Directed probabilistic diagram of the event count in 525 of them. Only 104 directed- 11 modelThursday, for June one 27, 13(s, r, t) dyad-time context, for the dyads have more than 25 nonzero weeks.