Exploiting Parallel News Streams for Unsupervised Event Extraction Congle Zhang, Stephen Soderland & Daniel S
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting Parallel News Streams for Unsupervised Event Extraction Congle Zhang, Stephen Soderland & Daniel S. Weld Computer Science & Engineering University of Washington Seattle, WA 98195, USA clzhang, soderlan, weld @cs.washington.edu { } Abstract Unfortunately, while distant supervision can work well in some situations, the method is limited to rela- Most approaches to relation extraction, the tively static facts (e.g., born-in(person, location) or task of extracting ground facts from natural capital-of(location,location)) where there is a cor- language text, are based on machine learning responding knowledge base. But what about dy- and thus starved by scarce training data. Man- ual annotation is too expensive to scale to a namic event relations (also known as fluents), such comprehensive set of relations. Distant super- as travel-to(person, location) or fire(organization, vision, which automatically creates training person)? Since these time-dependent facts are data, only works with relations that already ephemeral, they are rarely stored in a pre-existing populate a knowledge base (KB). Unfortu- KB. At the same time, knowledge of real-time nately, KBs such as FreeBase rarely cover events is crucial for making informed decisions in event relations (e.g. “person travels to loca- fields like finance and politics. Indeed, news stories tion”). Thus, the problem of extracting a wide range of events — e.g., from news streams — report events almost exclusively, so learning to ex- is an important, open challenge. tract events is an important open problem. This paper introduces NEWSSPIKE-RE, a This paper develops a new unsupervised tech- novel, unsupervised algorithm that discovers nique, NEWSSPIKE-RE, to both discover event rela- event relations and then learns to extract them. tions and extract them with high precision. The in- NEWSSPIKE-RE uses a novel probabilistic graphical model to cluster sentences describ- tuition underlying NEWSSPIKE-RE is that the text ing similar events from parallel news streams. of articles from two different news sources are not These clusters then comprise training data independent, since they are each conditioned on the for the extractor. Our evaluation shows that same real-world events. By looking for rarely de- NEWSSPIKE-RE generates high quality train- scribed entities that suddenly “spike” in popularity ing sentences and learns extractors that per- on a given date, one can identify paraphrases. Such form much better than rival approaches, more temporal correspondence (Zhang and Weld, 2013) than doubling the area under a precision-recall curve compared to Universal Schemas. allow one to cluster diverse sentences, and the re- sulting clusters may be used to form training data in order to learn event extractors. Furthermore, one can 1 Introduction also exploit parallel news to obtain direct negative Relation extraction, the process of extracting struc- evidence. To see this, suppose one day the news in- tured information from natural language text, grows cludes the following: (a) “Snowden travels to Hong increasingly important for Web search and ques- Kong, off southeastern China.” (b) “Snowden can- tion answering. Traditional supervised approaches, not stay in Hong Kong as Chinese officials will not which can achieve high precision and recall, are lim- allow ...” Since news stories are usually coherent, it ited by the cost of labeling training data and are un- is highly unlikely that travel to and stay in (which is likely to scale to the thousands of relations on the negated) are synonymous. By leveraging such direct Web. Another approach, distant supervision (Craven negative phrases, we can learn extractors capable of and Kumlien, 1999; Wu and Weld, 2007), creates its distinguishing heavily co-occurring but semantically own training data by matching the ground instances different phrases, thereby avoiding many extraction of a Knowledge base (KB) (e.g. Freebase) to the un- errors. Our NEWSSPIKE-RE system encapuslates labeled text. these intuitions in a novel graphical model making 117 Transactions of the Association for Computational Linguistics, vol. 3, pp. 117–129, 2015. Action Editor: Hal Daume´ III. Submission batch: 10/2014; Revision batch 1/2015; Published 2/2015. c 2015 Association for Computational Linguistics. Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00127 by guest on 29 September 2021 the following contributions: nately, most facts existing in the KBs are static facts We develop a method to discover a set of dis- like geographical or biographical data. They fall • tinct, salient event relations from news streams. short of learning extractors for fluent facts such as sports results or travel and meetings by a person. We describe an algorithm to exploit paral- • lel news streams to cluster sentences that be- Bootstrapping is another common extraction long to the same event relations. In particu- technique (Brin, 1999; Agichtein and Gravano, lar, we propose the temporal negation heuris- 2000; Carlson et al., 2010; Nakashole et al., 2011; tic to avoid conflating co-occurring but non- Huang and Riloff, 2013). This typically takes a set synonymous phrases. of seeds as input, which can be ground instances or key phrases. The algorithms then iteratively gener- We introduce a probabilistic graphical model to • ate more positive instances and phrases. While there generate training for a sentential event extractor are many successful examples of bootstrapping, the without requiring any human annotations. challenge is to avoid semantic drift. Large-scale sys- We present detailed experiments demonstrating • tems, therefore, often require extra processing such that the event extractors, learned from the gener- as manual validation between the iterations or addi- ated training data, significantly outperform sev- tional negative seeds as the input. eral competitive baselines, e.g. our system more Unsupervised approaches have been developed than doubles the area under the micro-averaged, for relation discovery and extractions. These algo- PR curve (0.80 vs. 0.30) compared to Riedel’s rithms are usually based on some clustering assump- Universal Schema (Riedel et al., 2013). tions over a large unlabeled corpus. Common as- sumptions include the distributional hypothesis used 2 Previous Work by (Hasegawa et al., 2004; Shinyama and Sekine, Supervised learning approaches have been widely 2006), latent topic assumption by (Yao et al., 2012; developed for event extraction tasks such as MUC-4 Yao et al., 2011), and low rank assumption by (Taka- and ACE. They often focus on a hand-crafted on- matsu et al., 2011; Riedel et al., 2013). Since the tology and train the extractor with manually created assumptions largely rely on co-occurrence, previous training data. While they can offer high precision unsupervised approaches tend to confuse correlated and recall, they are often domain-specific (e.g. bio- but semantically different phrases during extraction. logical events (Riedel et al., 2011; McClosky et al., In contrast to this, our work largely avoids these er- 2011) and entertainment events (Benson et al., 2011; rors by exploiting the temporal negation heuristic Reichart and Barzilay, 2012)), and are hard to scale in parallel news streams. In addition, unlike many over the events on the Web. unsupervised algorithms requiring human effort to Open IE systems extract open domain relations canonicalize the clusters, our work automatically (e.g. (Banko et al., 2007; Fader et al., 2011)) and discovers events with readable names. events (e.g. (Ritter et al., 2012)). They often perform Paraphrasing techniques inspire our work. Some self-supervised learning of relation-independent ex- techniques, such as DIRT (Lin and Pantel, 2001) tractions. It allows them to scale but makes them and Resolver (Yates and Etzioni, 2009), are based unable to output canonicalized relations. on the distributional hypothesis. Another common Distant supervised approaches have been devel- approach is to use parallel corpora, including news oped to learn extractors by exploiting the facts exis- streams (Barzilay and Lee, 2003; Dolan et al., 2004; ting in a knowledge base, thus avoiding human an- Zhang and Weld, 2013), multiple translations of the notation. Wu et al. (2007) and Reschke et al. (2014) same story (Barzilay and McKeown, 2001) and learned Infobox relations from Wikipedia, while bilingual sentence pairs (Ganitkevitch et al., 2013) Mintz et al. (2009) heuristically matched Freebase to generate the paraphrases. Although these algo- facts to texts. Since the training data generated rithms create many good paraphrases, they can not by the heuristic matching is often imperfect, multi- be directly used to generate enough training data to instance learning approaches (Riedel et al., 2010; train a relation extractor for two reasons: first, the Hoffmann et al., 2011; Surdeanu et al., 2012) have semantics of the paraphrases is often context depen- been developed to combat this problem. Unfortu- dent; second, the generated paraphrases are often in 118 Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00127 by guest on 29 September 2021 Parallel news Event streams news sources typically use different sentences to de- Discover E=e(t1,t2) Test scribe the same event, and that corresponding sen- event sentences Extractions tences can be identified when they mention a unique relations s s→ E(a1,a2) Group Generate pair of real-world entities. For example, when an un- training data input extract usual entity pair (Selena, Norway) is suddenly seen NS=(a1,a2,d,S)(a ,a ,t) E=e((a t,a1,t,t)2) 1 2 1 2 learn S={sr r r, s ,s } rs→r E(ar 1,a2) in three articles on a single day: r11 2r123r 32 3 r1 2r 3r Event r1 rr2 r3 s’→1r1E(a’2r2 3r 3 ,a’ ) 4 5 r4 r5 1 2 Extractor NewsSpike w/ Training sentences Selena traveled to Norway to see her ex-boyfriend. Parallel sentences Selena arrived in Norway for a rendezvous with Justin. Selena’s trip to Norway was no coincidence. Training Phase Testing Phase It is likely that all three refer to the same event re- Figure 1: During its training phase, NEWSSPIKE-RE 1 first groups parallel sentences as NewsSpikes.