Cross-Lingual Wikification Using Multilingual Embeddings

Cross-lingual Wikification Using Multilingual Embeddings Chen-Tse Tsai and Dan Roth University of Illinois at Urbana-Champaign 201 N. Goodwin, Urbana, Illinois, 61801 ctsai12, danr @illinois.edu { } Abstract Recently, there has also been interest in the cross- lingual setting of Wikification: given a mention Cross-lingual Wikification is the task of grounding mentions written in non-English from a document written in a foreign language, the documents to entries in the English Wikipedia. goal is to find the corresponding title in the English This task involves the problem of comparing Wikipedia. This task is driven partly by the fact that textual clues across languages, which requires a lot of information around the world may be written developing a notion of similarity between text in a foreign language for which there are limited lin- snippets across languages. In this paper, we guistic resources and, specifically, no English trans- address this problem by jointly training multi- lation technology. Instead of translating the whole lingual embeddings for words and Wikipedia titles. The proposed method can be applied document to English, grounding the important entity to all languages represented in Wikipedia, in- mentions in the English Wikipedia may be a good cluding those for which no machine trans- solution that could better capture the key message lation technology is available. We create a of the text, especially if it can be reliably achieved challenging dataset in 12 languages and show with fewer resources than those needed to develop a that our proposed approach outperforms var- translation system. This task is mainly driven by the ious baselines. Moreover, our model com- Text Analysis Conference (TAC) Knowledge Base pares favorably with the best systems on the Population (KBP) Entity Linking Tracks (Ji et al., TAC KBP2015 Entity Linking task including those that relied on the availability of transla- 2012; Ji et al., 2015; Ji et al., 2016), where the target tion from the target language to English. languages are Spanish and Chinese. In this paper, we develop a general technique 1 Introduction which can be applied to all languages in Wikipedia Wikipedia has become an indispensable resource in even when no machine translation technology is knowledge acquisition and text understanding for available for them. both human beings and computers. The task of The challenges in Wikification are due both to Wikification or Entity Linking aims at disambiguat- ambiguity and variability in expressing entities and ing mentions (sub-strings) in text to the correspond- concepts: a given mention in text, e.g., Chicago, ing titles (entries) in Wikipedia or other Knowledge may refer to different titles in Wikipedia (Chicago Bases, such as FreeBase. For English text, this Bulls, the City, Chicago Bears, the band, etc.), and problem has been studied extensively (Bunescu and a title can be expressed in the text in multiple ways, Pasca, 2006; Cucerzan, 2007; Mihalcea and Csomai, such as synonyms and nicknames. These challenges 2007; Ratinov et al., 2011; Cheng and Roth, 2013). are usually resolved by calculating some similarity It also has been shown to be a valuable component of between the representation of the mention and can- several natural language processing and information didate titles. For instance, the mention could be rep- extraction tasks across different domains. resented using its neighboring words, whereas a ti- 589 Proceedings of NAACL-HLT 2016, pages 589–598, San Diego, California, June 12-17, 2016. c 2016 Association for Computational Linguistics tle is usually represented by the words and entities For evaluation purposes, we focus in this paper in the document which introduces the title. In the on mentions that have corresponding titles in both cross-lingual setting, an additional challenge arises the English and the foreign language Wikipedia, from the need to match words in a foreign language and concentrate on disambiguating titles across lan- to an English title. guages. This allows us to evaluate on a large number In this paper, we address this problem by using of Wikipedia documents. Note that under this set- multilingual title and word embeddings. We repre- ting, a natural approach is to do wikification on the sent words and Wikipedia titles in both the foreign foreign language and then follow the language links language and in English in the same continuous vec- to obtain the corresponding English titles. However, tor space, which allows us to compute meaningful this approach requires developing a separate wiki- similarity between mentions in the foreign language fier for each foreign language if it uses language- and titles in English. We show that learning these specific features, while our approach is generic and embeddings only requires Wikipedia documents and only requires using the appropriate embeddings. Im- language links between the titles across different portantly, the aforementioned approach will also not languages, which are quite common in Wikipedia. generalize to the cases where the target titles only Therefore, we can learn embeddings for all lan- exist in the English Wikipedia while ours does. guages in Wikipedia without any additional anno- We create a challenging Wikipedia dataset for 12 tation or supervision. foreign languages and show that the proposed ap- Another notable challenge for the cross-lingual proach, WikiME (Wikification using Multilingual setting that we do not address in this paper is that of Embeddings), consistently outperforms various generating English candidate titles given a foreign baselines. Moreover, the results on the TAC mention when there is no corresponding title in the KBP2015 Entity Linking dataset show that our ap- foreign language Wikipedia. If a title exists in both proach compares favorably with the best Spanish the English and the foreign language Wikipedia, system and the best Chinese system despite using there could be examples of using this title in the significantly weaker resources (no need for transla- foreign language Wikipedia text, and this information). We note that the need for translation would tion could help us determine the possible English ti- have prevented the wikification of 12 languages used tles. For example, Vladimir N. Vapnik exists in both in this paper. the English Wikipedia (en/Vladimir Vapnik)1 and the Chinese Wikipedia (zh/⌫…˙s ⌥n< 2 Task Definition and Model Overview · K). In the Chinese Wikipedia, we may see the use of We formalize the problem as follows. We are given the mention ,n<K as a reference, that is, ,n< a document d in a foreign language, a set of men- K is linked to the title zh/⌫…˙s ⌥n<K. Fol- · tions M = m , ,m in d, and the English lowing the inter-language links in Wikipedia, we can d { 1 ··· n} Wikipedia. For each mention in the document, the reach the English title en/Vladimir Vapnik. goal is to retrieve the English Wikipedia title that the On the other hand, Dan Roth does not have a page mention refers to. If the corresponding entity or con- in the Chinese Wikipedia, it would have been harder cept does not exist in the English Wikipedia, “NIL” to get to en/Dan Roth from the Chinese mention. should be the answer. In this case, a transliteration model may be needed. Given a mention m M , the first step is to gen- Note that the difference between these two cases is 2 d C only in generating English title candidates from the erate a set of title candidates m. The goal of this given foreign mention. The disambiguation method step is to quickly produce a short list of titles which which identifies the most probable title is conceptu- includes the correct answer. We only look at the sur- ally the same, so our method could generalize as is face form of the mention in this step, that is, no con- to this case. textual information is used. The second and the key is the ranking step where 1 we calculate a score for each title candidate c C , We use en/Vladimir Vapnik to refer to the title of 2 m en.wikipedia.org/wiki/Vladimir Vapnik which indicates how relevant it is to the given men- 590 tion. We represent the mention using various con- objective: textual clues and compute several similarity scores 1 1 between the mention and the English title candidates log + log , 1+e vc0 vw 1 e vc0 vw (w,c) D − · (w,c) D − · based on multilingual word and title embeddings. A X2 X2 0 − ranking model learnt from Wikipedia documents is where w is the target token (word or title), c is a con- used to combine these similarity scores and output text token within a window of w , vw is the target the final score for each title candidate. We then se- embedding represents w, vc0 is the embedding of c in lect the candidate with the highest score as the an- context, D is the set of training documents, and D0 swer, or output NIL if there is no appropriate candi- contains the sampled token pairs which serve as neg- date. ative examples. This objective is maximized with The rest of paper is structured as follows. Sec- respect to variables vw’s and vw0 ’s. In this model, tion 3 introduces our approach of generating multi- tokens in the context are used to predict the target lingual word and title embeddings for all languages token. The token pairs in the training documents are in Wikipedia. Section 4 presents the proposed cross- positive examples, and the randomly sampled pairs lingual wikification model which is based on multi- are negative examples. lingual embeddings. Evaluations and analyses are 3.2 Multilingual Embeddings presented in Section 5. Section 6 discusses related work. Finally, Section 7 concludes the paper. After getting monolingual embeddings, we adopt the model proposed in Faruqui and Dyer (2014) to project the embeddings of a foreign language and 3 Multilingual Entity and Word English to the same space.

Load more