Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes
Total Page:16
File Type:pdf, Size:1020Kb
Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes Davaajav Jargalsaikhan Naoaki Okazaki Koji Matsuda Kentaro Inui Tohoku University, Japan davaajav, okazaki, matsuda, inui @ecei.toholu.ac.jp { } Abstract EL is useful for various NLP tasks, e.g., Question-Answering (Khalid et al., 2008), Infor- In this research, we build a Wikifica- mation Retrieval (Blanco et al., 2015), Knowl- tion corpus for advancing Japanese En- edge Base Population (Dredze et al., 2010), Co- tity Linking. This corpus consists of 340 Reference Resolution (Hajishirzi et al., 2013). Japanese newspaper articles with 25,675 There are about a dozen of datasets target- entity mentions. All entity mentions are ing EL in English, including UIUC datasets labeled by a fine-grained semantic classes (ACE, MSNBC) (Ratinov et al., 2011), AIDA (200 classes), and 19,121 mentions were datasets (Hoffart et al., 2011), and TAC-KBP successfully linked to Japanese Wikipedia datasets (2009–2012 datasets) (McNamee and articles. Even with the fine-grained se- Dang, 2009). mantic classes, we found it hard to define Ling et al. (2015) discussed various challenges the target of entity linking annotations and in EL. They argued that the existing datasets are to utilize the fine-grained semantic classes inconsistent with each other. For instance, TAC- to improve the accuracy of entity linking. KBP targets only mentions belonging to PER- 1 Introduction SON, LOCATION, ORGANIZATION classes. Although these entity classes may be dominant in Entity linking (EL) recognizes mentions in a text articles, other tasks may require information on and associates them to their corresponding en- natural phenomena, product names, and institu- tries in a knowledge base (KB), for example, tion names. In contrast, the MSNBC corpus does Wikipedia1, Freebase (Bollacker et al., 2008), and not limit entity classes, linking mentions to any DBPedia (Lehmann et al., 2015). In particu- Wikipedia article. However, the MSNBC corpus lar, when linked to Wikipedia articles, the task is does not have a NIL label even if a mention be- called Wikifiation (Mihalcea and Csomai, 2007). longs to an important class such as PERSON or Let us consider the following sentence. LOCATION, unlike the TAC-KBP corpus. On the 2nd of June, the team of Japan There are few studies addressing on Japanese will play World Cup (W Cup) qualifica- EL. Furukawa et al. (2014) conducted a study on tion match against Honduras in the sec- recognizing technical terms appearing in academic ond round of Kirin Cup at Kobe Wing articles and linking them to English Wikipedia Stadium, the venue for the World Cup. articles. Hayashi et al. (2014) proposed an EL Wikification is expected to link “Soccer” to the method that simultaneously performs both English Wikipedia article titled Soccer, “World Cup” and and Japanese Wikification, given parallel texts in “W Cup”2 to FIFA World Cup 2002, “team of both languages. Nakamura et al. (2015) links Japan” to Japan Football Team, “Kobe City” to keywords in social media into English Wikipedia, Kobe, “Kobe Wing Stadium” to Misaki Park Sta- aiming at a cross-language system that recognizes dium. Since there is no entry for “Second Round topics of social media written in any language. Os- of Kirin Cup”, the mention is labeled as NIL. ada et al. (2015) proposed a method to link men- tions in news articles for organizing local news of 1https://www.wikipedia.org/ 2“W Cup” is a Japanese-style abbreviation of “World different prefectures in Japan. Cup”. However, these studies do not necessarily ad- 138 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics – Student Research Workshop, pages 138–144, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics vance EL on a Japanese KB. As of January 2016, is limited to mentions that belong to PERSON, Japanese Wikipedia and English Wikipedia in- LOCATION or ORGANIZATION classes only. clude about 1 million and 5 million, respectively, When an article is not present in Wikipedia, UIUC articles. However, there are only around 0.56 mil- does not record this information in any way. On lion inter-language links between Japanese and the contrary, TAC-KBP5 and our datasets have NIL English. Since most of the existing KBs (e.g., tag used to mark a mention when it does not have Freebase and DBPedia) originate from Wikipedia, an entry in KB. we cannot expect that English KBs cover enti- ties that are specific to Japanese culture, locals, 2.1 Design Policy and economics. Moreover, a Japanese EL system Ling et al. (2015) argued that the task definition is useful for populating English knowledge base of EL itself is challenging: whether to target only as well, harvesting source documents written in named entities (NEs) or to include general nouns; Japanese. whether to limit semantic classes of target NEs; To make matters worse, we do not have a cor- how to define NE boundaries; how specific the pus for Japanese EL, i.e., Japanese mentions as- links should be; and how to handle metonymy. sociated with Japanese KB. Although (Murawaki The original (Hashimoto et al., 2008) corpus is and Mori, 2016) concern with Japanese EL, the also faced with similar challenges: mention abbre- corpus they have built is not necessarily a cor- viations that result in the string representation that pus for Japanese EL. The motivation behind their is an exact match to the string representation of work comes from the difficulty of word segmen- another mention, abbreviated or not (for example, tation for unsegmented languages, like Chinese or “Tokyo (City)” and “TV Tokyo”), metonymy and Japanese. (Murawaki and Mori, 2016) approach synecdoche. the word segmentation problem from point of view As for the mention “World Cup” in the exam- of Wikification. Their focus is on the word seg- ple in Section 1, we have three possible candidates mentation rather than on the linking. entities, World Cup, FIFA World Cup, and 2002 In this research, we build a Japanese Wikifica- FIFA World Cup. Although all of them look rea- tion corpus in which mentions in Japanese doc- sonable, 2002 FIFA World Cup is the most suit- uments are associated with Japanese Wikipedia able, being more specific than others. At the same articles. The corpus consists of 340 newspaper time, we cannot expect that Wikipedia includes the articles from Balanced Corpus of Contemporary most specific entities. For example, let us suppose 3 Written Japanese (BCCWJ) annotated with fine- that we have a text discussing a possible venue grained named entity labels defined by Sekine’s for 2034 FIFA World Cup. As of January 2016, Extended Named Entity Hierarchy (Sekine et al., Wikipedia does not include an article about 2034 4 2002) . FIFA World Cup6. Thus, it may be a difficult deci- sion whether to link it to FIFA World Cup or make 2 Dataset Construction it NIL. To give a better understanding of our dataset we Moreover, the mention “Kobe Wing Stadium” briefly compare it with existing English datasets. includes nested NE mentions, “Kobe (City)” and The most comparable ones are UIUC (Ratinov “Kobe Wing Stadium”. Furthermore, although the et al., 2011) and TAC-KBP 2009–2012 datasets article titled “Kobe Wing Stadium” does exist in (McNamee and Dang, 2009). Although, AIDA Japanese Wikipedia, the article does not explain datasets are widely used for Disambiguation of the stadium itself but explains the company run- Entities, AIDA uses YAGO, an unique Knowl- ning the stadium. Japanese Wikipedia includes edge Base derived from Wikipedia, GeoNames a separate article Misaki Park Stadium describing and Wordnet, which makes it difficult to com- the stadium. In addition, the mention “Honduras” pare. UIUC is similar to our dataset in a sense that does not refer to Honduras as a country, but as the it links to any Wikipedia article without any se- national soccer team of Honduras. mantic class restrictions, unlike TAC-KBP which In order to separate these issues raised by NEs 3http://pj.ninjal.ac.jp/corpus_center/ 5TAC-KBP 2012 requires NIL to be clustered in accor- bccwj/en/ dance to the semantic classes. 4https://sites.google.com/site/ 6Surprisingly, Wikipedia includes articles for the future extendednamedentityhierarchy/ World Cups up to 2030. 139 from the EL task, we decided to build a Wikifica- Attribute Value tion corpus on top of a portion of BCCWJ cor- # articles 340 pora with Extended Named Entity labels anno- # mentions 25,675 tated (Hashimoto et al., 2008). This corpus con- # links 19,121 sists of 340 newspaper articles where NE bound- # NILs 6,554 aries and semantic classes are annotated. This # distinct mentions 7,118 design strategy has some advantages. First, we # distinct entities 6,008 can omit the discussion on semantic classes and boundaries of NEs. Second, we can analyze the Table 1: Statistics of the corpus built by this work. impact of semantic classes of NEs to the task of EL. Annotator pair Agreement 2.2 Annotation Procedure Annotators 1 and 2 0.910 We have used brat rapid annotation tool (Stene- Annotators 2 and 3 0.924 torp et al., 2012) to effectively link mentions to Wikipedia articles. Brat has a functionality of im- Table 2: Inter-annotator agreement. porting external KBs (e.g., Freebase or Wikipedia) for EL. We have prepared a KB for Brat using mention. a snapshot of Japanese Wikipedia accessed on November 2015. We associate a mention to a 2.3 Annotation Results Wikipedia ID so that we can uniquely locate an ar- Table 1 reports the statistics of the corpus built by ticle even when the title of the article is changed. this work. Out of 25,675 mentions satisfying the We configure Brat so that it can present a title and conditions explained in Section 2.2, 19,121 men- a lead sentence (short description) of each article tions were linked to Japanese Wikipedia articles.