Temporal Scoping of Relational Facts Based on Wikipedia Data
Total Page:16
File Type:pdf, Size:1020Kb
Temporal Scoping of Relational Facts based on Wikipedia Data Avirup Sil ∗ Silviu Cucerzan Computer and Information Sciences Microsoft Research Temple University One Microsoft Way Philadelphia, PA 19122 Redmond, WA 98052 [email protected] [email protected] Abstract president-Of(George W. Bush, USA) holds within the time-frame (2001-2009) only. In this Most previous work in information paper, we focus on the relatively less explored extraction from text has focused on problem of attaching temporal scope to relation named-entity recognition, entity linking, between entities. The Text Analysis Conference and relation extraction. Less attention (TAC) introduced temporal slot filling (TSF) as has been paid given to extracting the one of the knowledge base population (KBP) tasks temporal scope for relations between in 2013 (Dang and Surdeanu, 2013). The in- named entities; for example, the relation put to a TAC-TSF system is a binary relation e.g. president-Of(John F. Kennedy, USA) per:spouse(Brad Pitt, Jennifer Aniston) and a is true only in the time-frame (January document assumed to contain supporting evidence 20, 1961 - November 22, 1963). In this for the relation. The required output is a 4-tuple paper we present a system for temporal timestamp [T1, T2, T3, T4], where T1 and T2 scoping of relational facts, which is are normalized dates that provide a range for the trained on distant supervision based on the start date of the relation, and T3 and T4 provide largest semi-structured resource available: the range for the end of the relationship. Sys- Wikipedia. The system employs language tems must also output the offsets of the text men- models consisting of patterns automat- tions that support the temporal information ex- ically bootstrapped from Wikipedia tracted. For example, from a text such as “Pitt sentences that contain the main entity of married Jennifer Aniston on July 29, 2000 [...] the a page and slot-fillers extracted from the couple divorced five years later in 2005.”, a sys- corresponding infoboxes. This proposed tem must extract the normalized timestamp [2000- system achieves state-of-the-art results 07-29, 2000-07-29, 2005-01-01, 2005-12-31], to- on 6 out of 7 relations on the benchmark gether with the entity and date offsets that support Text Analysis Conference 2013 dataset the timestamp. for temporal slot filling (TSF), and out- In this paper, we describe TSRF, a system for performs the next best system in the TAC temporal scoping of relational facts. For ev- 2013 evaluation by more than 10 points. ery relation type, TSRF uses distant supervision from Wikipedia infobox tuples to learn a language 1 Introduction model consisting of patterns of entity types, cate- Previous work on relation extraction (Agichtein gories, and word n-grams. Then it uses this trained and Gravano, 2000; Etzioni et al., 2004) by sys- relation-specific language model to extract the top tems such as NELL (Carlson et al., 2010), Know- k sentences that support the given relation between ItAll (Etzioni et al., 2004) and YAGO (Suchanek the query entity and the slot filler. In a second et al., 2007) have targeted the extraction of en- stage, TSRF performs timestamp classification by tity tuples, such as president-Of(George W. employing models which learn “Start”, “End” and Bush, USA), in order to build large knowl- “In” predictors of entities in a relationship; it com- edge bases of facts. These systems assume putes the best 4-tuple timestamp [T1, T2, T3, T4] that relational facts are time-invariant. However, based on the confidence values associated to the this assumption is not always true, for example top sentences extracted. Following the TAC-TSF SR ∗ This research was carried out during an internship at task for 2013, T F is trained and evaluated for Microsoft Research. seven relation types, as shown in Table 1. 109 Proceedings of the Eighteenth Conference on Computational Language Learning, pages 109–118, Baltimore, Maryland USA, June 26-27 2014. c 2014 Association for Computational Linguistics per:spouse Col.1: TEMP72211 Col.7: 1492 per:title Col.2: per:spouse Col.8: 1311 per:employee or member of Col.3: Brad Pitt Col.9: 1.0 Col.4: AFP ENG 20081208.0592 Col.10: E0566375 org:top employees/members Col.5: Jennifer Aniston Col.11: E0082980 per:cities of residence Col.6: 1098 per:statesorprovinces of residence per:countries of residence Table 2: Input to a TSF System. Table 1: Types of relations in the TAC-TSF. in extracting temporally related events. Sil et al. (2011a) automatically extract STRIPS represen- The remainder of the paper is organized as fol- tations (Fikes and Nilsson, 1971) from web text, lows: The next section describes related work. which are defined as states of the world before and Section 3 introduces the TAC-TSF input and out- after an event takes place. However, all these ef- put formats. Section 4 discusses the main chal- forts focus on temporal ordering of either events or lenges, and Section 5 details our method for tem- states of the world and do not extract timestamps poral scoping of relations. Section 6 describes our for events. By contrast, the proposed system ex- experiments and results, and it is followed by con- tracts temporal expressions and also produces an cluding remarks. ordering of the timestamps of relational facts be- tween entities. 2 Related Work The current state-of-the-art systems for TSF To our knowledge, there are only a small num- have been the RPI-Blender system by Artiles et ber of systems that have tackled the temporal al. (2011) and the UNED system by Garrido et scoping of relations task. YAGO (Wang et al., al. (2011; 2012). These systems obtained the 2010) extracts temporal facts using regular expres- top scores in the 2011 TAC TSF evaluation by sions from Wikipedia infoboxes, while PRAVDA outperforming the other participants such as the (Wang et al., 2011) uses a combination of textual Stanford Distant Supervision system (Surdeanu patterns and graph-based re-ranking techniques to et al., 2011). Similar to our work, these sys- extract facts and their temporal scopes simultane- tems use distant supervision to assign temporal la- ously. Both systems augment an existing KB with bels to relations extracted from text. While we temporal facts similarly to the CoTS system by employ Wikipedia infoboxes in conjunction with Talukdar et al. (2012a; 2012b). However, their Wikipedia text, the RPI-Blender and UNED sys- underlying techniques are not applicable to arbi- tems use tuples from structured repositories like trary text. In contrast, TSRF automatically boot- Freebase. There are major differences in terms of straps patterns to learn relation-specific language learning strategies of these systems: the UNED models, which can be used then for processing system uses a rich graph-based document-level any text. CoTS, a recent system that is part of representation to generate novel features whereas CMU’s NELL (Carlson et al., 2010) project, per- RPI-Blender uses an ensemble of classifiers com- forms temporal scoping of relational facts by using bining flat features based on surface text and de- manually edited temporal order constraints. While pendency paths with tree kernels. Our system em- manual ordering is appealing and can lead to high ploys language models based on Wikipedia that accuracy, it is impractical from a scalability per- are annotated automatically with entity tags in a spective. Moreover, the main goal of CoTS is to boosted-trees learning framework. A less impor- predict temporal ordering of relations rather than tant difference between TSRF and RPI-Blender is to scope temporally individual facts. Conversely, that the latter makes use of an additional tempo- our system automatically extracts text patterns, ral label (Start-And-End) for facts within a time and then uses them to perform temporal classi- range; TSRF employs Start, End, and In labels. fication based on gradient boosted decision trees (Friedman, 2001). 3 The Temporal Slot Filling Task The TempEval task (Pustejovsky and Verhagen, 2009) focused mainly on temporal event order- 3.1 Input ing. Systems such as (Chambers et al., 2007) and The input format for a TSF system as instantiated (Bethard and Martin, 2007) have been successful for the relation per:spouse(Brad Pitt, Jennifer 110 Aniston) is shown in Table 2. The field Column 1 ample repeated marriages between two persons. contains a unique query ID for the relation. Col- However, the most common situations for the re- umn 2 is the name of the relationship, which also lations covered in this task are captured correctly encodes the type of the target entity. Column 3 by this 4-tuple representation. contains the name of the query entity, i.e., the sub- ject of the relation. Column 4 contains a valid doc- 4 Challenges ument ID and Column 5 indicates the slot-filler en- tity. Columns 6 through 8 are offsets of the slot- We discuss here some of the main challenges en- filler, query entity and the relationship justification countered in building a temporal scoping system. in the given text. Column 9 contains a confidence score set to 1 to indicate that the relation is cor- 4.1 Lack of Annotated Data rect. Columns 10 and 11 contain the IDs in the KBP knowledge base of the entity and filler, re- Annotation of data for this task is expensive, as spectively. All of the above are provided by TAC. the human annotators must have extensive back- For the query in this example, a TSF system has to ground knowledge and need to analyze the evi- scope temporally the per:spouse relation be- dence in text and reliable knowledge resources.