KOI at Semeval-2018 Task 5: Building Knowledge Graph of Incidents
Total Page:16
File Type:pdf, Size:1020Kb
KOI at SemEval-2018 Task 5: Building Knowledge Graph of Incidents 1 2 2 Paramita Mirza, Fariz Darari, ∗ Rahmad Mahendra ∗ 1 Max Planck Institute for Informatics, Germany 2 Faculty of Computer Science, Universitas Indonesia, Indonesia paramita @mpi-inf.mpg.de fariz,rahmad.mahendra{ } @cs.ui.ac.id { } Abstract Subtask S3 In order to answer questions of type (ii), participating systems are also required We present KOI (Knowledge of Incidents), a to identify participant roles in each identified an- system that given news articles as input, builds swer incident (e.g., victim, subject-suspect), and a knowledge graph (KOI-KG) of incidental use such information along with victim-related nu- events. KOI-KG can then be used to effi- merals (“three people were killed”) mentioned in ciently answer questions such as “How many the corresponding answer documents, i.e., docu- killing incidents happened in 2017 that involve ments that report on the answer incident, to deter- Sean?” The required steps in building the KG include: (i) document preprocessing involv- mine the total number of victims. ing word sense disambiguation, named-entity Datasets The organizers released two datasets: recognition, temporal expression recognition and normalization, and semantic role labeling; (i) test data, stemming from three domains of (ii) incidental event extraction and coreference gun violence, fire disasters and business, and (ii) resolution via document clustering; and (iii) trial data, covering only the gun violence domain. KG construction and population. Each dataset contains (i) an input document (in CoNLL format) that comprises news articles, and 1 Introduction (ii) a set of questions (in JSON format) to evaluate the participating systems.2 SemEval-20181 Task 5: Counting Events and Par- This paper describes the KOI (Knowledge of ticipants in the Long Tail (Postma et al., 2018) ad- Incidents) system submitted to SemEval-2018 dresses the problem of referential quantification Task 5, which constructs and populates a knowl- that requires a system to answer numerical ques- edge graph of incidental events mentioned in news tions about events such as (i) “How many killing articles, to be used to retrieve answer incidents incidents happened in June 2016 in San Antonio, and answer documents given numerical questions Texas?” or (ii) “How many people were killed in about events. We propose a fully unsupervised June 2016 in San Antonio, Texas?” approach to identify events and their properties Subtasks S1 and S2 For questions of type (i), in news texts, and to resolve within- and cross- which are asked by the first two subtasks, partic- document event coreference, which will be de- ipating systems must be able to identify the type tailed in the following section. (e.g., killing, injuring), time, location and partic- 2 System Description ipants of each event occurring in a given news article, and establish within- and cross-document 2.1 Document Preprocessing event coreference links. Subtask S1 focuses on Given an input document in CoNLL format (one evaluating systems’ performances on identifying token per line), for each news article, we first answer incidents, i.e., events whose properties fit split the sentences following the annotation of: (i) the constraints of the questions, by making sure whether a token is part of the article title or con- that there is only one answer incident per question. tent; (ii) sentence identifier; and (iii) whether a to- 2 ∗ Both share the same amount of work. https://competitions.codalab.org/ 1http://alt.qcri.org/semeval2018/ competitions/17285 81 Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), pages 81–87 New Orleans, Louisiana, June 5–6, 2018. ©2018 Association for Computational Linguistics ken is a newline character. We then ran several lated concepts occur either in the predicate itself tools on the tokenized sentences to obtain the fol- or in one of its arguments. For example, a predi- lowing NLP annotations. cate ‘shot’, with arguments ‘one man’ as its patient and ‘to death’ as its manner, will be considered as Word sense disambiguation (WSD) We ran 3 a killing event because of the occurrence of ‘death’ Babelfy (Moro et al., 2014) to get disambiguated concept.8 concepts (excluding stop-words), which can be Concept relatedness was computed via path- gunshot wound multi-word expressions, e.g., . based WordNet similarity (Hirst et al., 1998) Each concept is linked to a sense in Babel- 4 of a given BabelNet concept, which is linked Net (Navigli and Ponzetto, 2012), which subse- to a WordNet sense, with a predefined set of quently is also linked to a WordNet sense and a related WordNet senses for each event type DBpedia entity (if any). (e.g., wn30:killing.n.02 and wn30:kill.v.01 for the Named-entity recognition (NER) We relied on killing event), setting 5.0 as the threshold. Related spaCy5 for a statistical entity recognition, specifi- concepts were also annotated with the correspond- cally for identifying persons and geopolitical enti- ing event types, to be used for the mention-level ties (countries, cities, and states). event coreference evaluation. We then assumed all identified sentence-level Time expression recognition and normalization events in a news article belonging to the same 6 We used HeidelTime (Strotgen¨ and Gertz, 2013) event type to be automatically regarded as one for recognizing textual spans that indicate time, document-level event, meaning that each article e.g., this Monday, and normalizing the time ex- may contain at most four document-level events pressions according to a given document creation (i.e., at most one event per event type). time, e.g., 2018-03-05. Identifying document-level event participants Semantic role labeling (SRL) Senna7 (Col- Given a predicate as an identified event, its partic- lobert et al., 2011) was used to run semantic pars- ipants were simply extracted from the occurrence ing on the input text, for identifying sentence-level of named entities of type person, according to both events (i.e., predicates) and their participants. Senna and spaCy, in the agent and patient argu- ments of the predicate. Furthermore, we deter- 2.2 Event Extraction and Coreference mined the role of each participant as victim, perpe- Resolution trator or other, based on its mention in the pred- Identifying document-level events Sentence- icate. For example, if ‘Randall’ is mentioned as level events, i.e., predicates recognized by the the agent argument of the predicate ‘shot’, then he SRL tool, were considered as the candidates for is a perpetrator. Note that a participant can have the document-level events. Note that predicates multiple roles, as is the case for a person who kills containing other predicates as the patient argu- himself. ment, e.g., ‘says’ with arguments ‘police’ as its Taking into account all participants of a set of agent and ‘one man was shot to death’ as its pa- identified events (per event type) in a news article, tient, were not considered as candidate events. we extracted document-level event participants by Given a predicate, we simultaneously deter- resolving name coreference. For instance, ‘Ran- mined whether it is part of document-level events dall’, ‘Randall R. Coffland’, and ‘Randall Cof- and also identified its type, based on the occur- fland’ all refer to the same person. rence of BabelNet concepts that are related to four event types of interest stated in the task guidelines: Identifying document-level number of victims For each identified predicate in a given document, killing, injuring, fire burning and job firing.A we extracted the first existing numeral in the pa- predicate is automatically labeled as a sentence- tient argument of the predicate, e.g., one in ‘one level event with one of the four types if such re- man’. The normalized value of the numeral was 3http://babelfy.org/ then taken as the number of victims, as long as 4 http://babelnet.org/ the predicate is not suspect-related predicates such 5https://spacy.io/ 6https://github.com/HeidelTime/ 8We assume that a predicate that is labeled as a killing heideltime event cannot be labeled as an injuring event even though an 7https://ronan.collobert.com/senna/ injuring-related concept such as ‘shot’ occurs. 82 as suspected or charged. The number of victims Resource Type Properties of document-level events is simply the maximum IncidentEvent eventType, eventDate, location, value of identified number of victims per predi- participant, numOfVictims cate. Document docDate, docID, event Participant fullname, firstname, lastname, role Identifying document-level event locations To Location city, state retrieve candidate event locations given a docu- Date value, day, month, year ment, we relied on disambiguated DBpedia en- tities as a result of Babelfy annotation. We uti- Table 1: KOI-KG ontology lized SPARQL queries over the DBpedia SPARQL 9 endpoint to identify whether a DBpedia entity are of the same type, via their provenance, i.e., is a city or a state, and whether it is part of news articles where they were mentioned. From or located in a city or a state. Specifically, each news article we derived TF-IDF-based vec- an entity is considered to be a city whenever tors of (i) BabelNet senses and (ii) spaCy’s per- it is of type dbo:City or its equivalent types sons and geopolitical entities, which are then used (e.g., schema:City). Similarly, it is consid- to compute cosine similarities among the articles. ered to be a state whenever it is either of type Two news articles will be clustered together yago:WikicatStatesOfTheUnitedStates, has a if (i) the computed similarity is above a certain senator (via the property dbp:senators), or has threshold, which was optimized using the trial dbc:States of the United States as a subject. data, and (ii) the event time distance of document- Assuming that document-level events identified level events found in the articles does not exceed a in a given news article happen at one certain lo- certain threshold, i.e., 3 days.