SPINOZAVU: an NLP Pipeline for Cross Document Timelines

SPINOZA VU: An NLP Pipeline for Cross Document TimeLines Tommaso Caselli Antske Fokkens Roser Morante Piek Vossen Computational Lexicology & Terminology Lab (CLTL) VU Amsterdam, De Boelelaan 1105 1081 HV Amsterdam Nederland t.caselli antske.fokkens r.morantevallejo p.t.j.m.vossen @vu.nl { }{ }{ }{ } Abstract event detection and ordering; event coreference (in- document and cross-document); and entity-based This paper describes the system temporal processing. SPINOZA VU developed for the SemEval The SPINOZA VU system is based on the News- 2015 Task 4: Cross Document TimeLines. Reader (NWR) NLP pipeline (Agerri et al., 2013; The system integrates output from the News- Beloki et al., 2014), which has been developed Reader Natural Language Processing pipeline 1 and is designed following an entity based within the context of the NWR project and pro- model. The poor performance of the submit- vides multi-layer annotations over raw texts from ted runs are mainly a consequence of error tokenization up to temporal relations. The goal of propagation. Nevertheless, the error analysis the NWR project is to build structured event in- has shown that the interpretation module dexes from large volumes of news data addressing behind the system performs correctly. An the same research issues as the task. Within this out of competition version of the system has framework, we are developing a storyline module fixed some errors and obtained competitive results. Therefore, we consider the system an which aims at providing more structured represen- important step towards a more complex task tation of events and their relations. Timeline extrac- such as storyline extraction. tion from raw text qualifies as the first component of this new module. This is why we participated in Track A and Subtrack A of the task, timeline extrac- 1 Introduction tion from raw text. Participating in Track B would This paper reports on a system (SPINOZA VU) for require a full re-engineering of the NWR pipeline timeline extraction developed at the CLTL Lab of and of our system. the VU Amsterdam in the context of the SemEval The remainder of the paper is structured as fol- 2015 Task 4: Cross Document TimeLines. In this lows: Section 2 provides an overview of the model task, a timeline is defined as a set of chronologically implemented in the two versions of our system. Sec- anchored and ordered events extracted from a corpus tion 3 presents the results and error analysis, and spanning over a (large) period of time with respect Section 4 puts forward some conclusions. to a target entity. 2 From Model to System Cross-document timeline extraction benefits from previous works and evaluation campaigns in Tem- Timeline extraction involves a number of indepen- poral Processing, such as the TempEval evaluation dent though highly connected subtasks, the most campaigns (Verhagen et al., 2007; Verhagen et al., relevant ones being: entity resolution, event detec- 2010; UzZaman et al., 2013) and aims at promoting tion, event-participant linking, coreference resolu- research in temporal processing by tackling the following issues: cross-document and cross-temporal 1http://www.newsreader-project.eu 787 Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 787–791, Denver, Colorado, June 4-5, 2015. c 2015 Association for Computational Linguistics tion, factuality profiling, and temporal relation pro- • SPINOZA VU 1 uses the output of a state of cessing (ordering and anchoring). the art system, TIPSem (Llorens et al., 2010), We designed a system that addresses these sub- for event detection and temporal relations; tasks, first at document level, and then, at cross- • SPINOZA VU 2 is entirely based on data document level. We diverted from the general NWR from the NWR pipeline including the temporal approach and adopted an entity based model and (TLINKs) and causal relation (CLINKs) layers. representation rather than an event based one in order to fit the task. This means that we used entities The final output is based on a dedicated rule- as hub of information for timelines. Using an entity based module, the TimeLine (TML) module. We driven representation allows us to better model the will describe in the following paragraphs how each following aspects: subtask has been tackled with respect to each version of the system. • Event co-participation: the data collected Entity identification Entity identification relies with this method facilitates the analysis of the on the entity detection and disambiguation layer interactions between the participants involved (NERD) of the NWR pipeline. Each detected en- in an event individually; tity is associated with a URI (a unique identifier), • Event relations: in an entity based representa- either from DBpedia or a specifically created one tion, event mentions with more than one entity based on the strings describing the entity. We ex- as their participants will be repeated in the final tracted the entities by merging information from the representation (both at in-document at cross- NERD layer with that from the semantic role la- document levels); such a representation can be belling (SRL) layer. We retained only those en- further used to explore and discover additional tity mentions which fulfil the argument positions event relations2; of proto-agent (Arg0) or proto-patient (Arg1) in the • Event coreference: we assume that two event SRL layer. mentions (either in the same document or in Event detection and classification The different documents) are coreferential if they SPINOZA VU 1 event module is based on share the same participant set (i.e., entities) and TIPSem, which provides TimeML compliant data. occur at the same time and place (Chen et al., We developed post processing rules to convert the 2011; Cybulska and Vossen, 2013); TimeML event classes (OCCURRENCE, STATE, • Temporal relations: temporal relation pro- I ACTION, I STATE, ASPECTUAL, REPORT- cessing can benefit from an entity driven ap- ING and PERCEPTION) to specific FrameNet proach as sequences of events sharing the same frames (e.g., Communication, Being in operation, entities (i.e., co-participant events) can be as- Body movement) and/or Event Situation Ontology sumed to stand in precedence relation (Cham- (ESO) types (Segers et al., 2015) (e.g., contextual), bers and Jurafsky, 2009; Chambers and Juraf- which correspond to the event types specified in the sky, 2010). task guidelines. For instance, not all mentions of TimeML I STATE, I ACTION, OCCURRENCE 2.1 The SPINOZA VU System and STATE events can enter a timeline. The The NWR pipeline which forms the basis of the alignment with FrameNet and ESO is made by SPINOZA VU system consists of 15 modules car- combining the data from the Word Sense Dis- rying out various NLP tasks and outputs the results ambiguation (WSD) layer of the pipeline with in NLP Annotation Format (Fokkens et al., 2014), Predicate Matrix (version 1.1) (Lacalle et al., 2014). a layered standoff representation format. Two ver- As for the SPINOZA VU 2, we have used the sions of the system have been developed, namely: NWR SRL layer to identify and retain the eligi- 2 ble events. In this case the access to the Predi- We are referring to a broader set of relations that we la- cate Matrix is not necessary as each predicate in beled as “bridging relations” among events which involve co- participation, primary and secondary causal relations, temporal the SRL layer is also associated with corresponding relations, and entailment relations. FrameNet frames and ESO types. Only the pred- 788 icates matching specific FrameNet frames and/or TimeLine Extraction The TimeLine Extrac- ESO types were retained as candidate events. tion (TML) module3 harmonizes and orders cross- Factuality The factuality filter consists of a col- document temporal relations (anchoring and order- lection of rules in order to determine whether an ing). It provides a method for selecting the initial event is within the scope of a factuality marker (relevant) temporal relations (namely, all anchoring negating an event or indicating that it is uncertain, in relations) and enhance an updating mechanism of in- which case the event is excluded from the set of el- formation so that additional temporal relations (both igible events. Factuality markers are different types anchoring and ordering relations) can be inferred. of modality and negation cues (adverbs, adjectives, Timelines are first created at a document level and prepositions, modal auxiliaries, pronouns and deter- subsequently merged. The cross-document timeline miners). For instance, if a verb has a dependency model is event-based and aims at building a global relation of type AM-MOD with a modal auxiliary is timeline between all events and temporal expres- excluded from the candidate event in the timeline. sions regardless of the target entities. This approach Coreference relations Two levels of corefer- allows us to also make use of temporal information ence need to be addressed: in-document and cross- provided by events that are not part of the final time- document. As for the former, both versions of the lines. Finally, the target entities for the timelines are system rely on the coreference layer (COREF layer) extracted using two strategies: i) a perfect match be- of the pipeline. Concerning the cross-document tween the target entities and the DBpedia URIs, and level, two strategies have been implemented: ii) the Levenshtein distance (Levenshtein, 1966) between the target entities and the URIs. For this latter • Cross-document entity mentions are identified strategy, an empirical threshold was set to maximize using the URI links associated with entity men- precision on the basis of the trial data. tions; all entity mentions from different documents sharing the same URIs are associated 3 Results and Error Analysis with the same entity instance; • Cross-document event coreference is obtained In Table 1 we report the results of both versions of during a post-processing step of the timeline the system for Track A - Main.

Load more