Cross Lingual Entity Linking with Bilingual Topic Model
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Cross Lingual Entity Linking with Bilingual Topic Model Tao Zhang, Kang Liu and Jun Zhao Institute of Automation, Chinese Academy of Sciences HaiDian District, Beijing, China {tzhang, kliu, jzhao}@nlpr.ia.ac.cn Abstract billion web pages on Internet and 31% of them are written in other languages. This report tells us that the percentage of Cross lingual entity linking means linking an entity web pages written in English are decreasing, which indicates mention in a background source document in one that English pages are becoming less dominant compared language with the corresponding real world entity in with before. Therefore, it becomes more and more important a knowledge base written in the other language. The to maintain the growing knowledge base by discovering and key problem is to measure the similarity score extracting information from all web contents written in between the context of the entity mention and the different languages. Also, inserting new extracted knowledge document of the candidate entity. This paper derived from the information extraction system into a presents a general framework for doing cross knowledge base inevitably needs a system to map the entity lingual entity linking by leveraging a large scale and mentions in one language to their entries in a knowledge base bilingual knowledge base, Wikipedia. We introduce which may be another language. This task is known as a bilingual topic model that mining bilingual topic cross-lingual entity linking. from this knowledge base with the assumption that Intuitively, to resolve the cross-lingual entity linking the same Wikipedia concept documents of two problem, the key is how to define the similarity score different languages share the same semantic topic between the context of entity mention and the document of distribution. The extracted topics have two types of candidate entity. Once the similarity score is defined, the representation, with each type corresponding to one system can select the entity with the highest score as the language. Thus both the context of the entity answer in an unsupervised way or use it as a feature in a mention and the document of the candidate entity supervised framework. However, the context of the entity can be represented in a space using the same mention and the document associated with the entity are semantic topics. We use these topics to do cross described using different languages. Thus, the traditional lingual entity linking. Experimental results show BOW model, which is based on exact matching of terms that the proposed approach can obtain the between the query and candidate entity, is not suitable for the competitive results compared with the state-of-art cross-lingual entity linking task. approach. To overcome this problem, many methods have been proposed. Previous studies [McNamee et al., 2011] usually 1 Introduction use statistical machine translation tool to first translate the The web is growing rapidly, which can be reflected by the original document contexts into English equivalents and number of Internet user and the amount of web content on the transform the cross-lingual problem into a monolingual Internet. It has become a central goal for search engine entity linking problem. However, the performance of the company to maximize the user’s satisfaction. Moreover, entity linking system is highly depended on the performance since a user’s query useually expresses the information need of the statistical machine translation system. In order to train of a named entity, it would be very useful if a general search the statistical machine translation system, one must obtain engine could automatically discover information about parallel corpora in which texts in one language and their named entities. The emergence of large scale knowledge base translation in other language are well aligned at word or like Wikipedia and DBpedia has facilitated this problem. sentence level. Such corpora are usually edited by human These knowledge bases are usually written in English. editors in specific domain. Obtaining such corpora is However, according to a report of August, 20081, there are 90 expensive. Therefore, the machine translation based method is not easy to adapt to other languages. Other researchers develop explicit semantic model for this 1 problem [Fahrni et al., 2011]. They regard the entity linking http://www.imakenews.com/lweaver/e_article001189962.cfm?x=b problem as an information retrieval problem and believe that dhk3s9,b4wmV4K7 the cross-lingual entity linking is the extreme case of the 2218 vocabulary mismatch problem. To overcome this, they map texts with respect to the given Wikipedia concepts using ESA method [Gabrilovich and Markovitch, 2007] and do entity linking in the space of Wikipedia concept. Multilingual knowledge base is needed for this method. However, the size of different knowledge base is different. For example, the English version Wikipedia in 2009 contains 3,000,000 articles, while the Chinese version Wikipedia only contain 300,000 articles, which is far small from the English version. This will cause the problem that large percentage of English Figure 1. Illustration of our proposed method Wikipedia concept have no corresponding Chinese concept, thus influence the performance of the method. 2 Related Work Inspired by the idea of using external data source The first type of research related to our work is about entity mentioned above, we present a general framework for linking. Many researchers developed entity linking systems building cross lingual entity linking system with hidden based on the context similarity between the query and the topics discovered from large-scale parallel corpora. candidate entities. Various approaches represent different Wikipedia, which was launched in 2001, has become a very methods in computing the similarity. Cucerzan [2007] big multilingual data resource. In Wikipedia, each article developed a disambiguation approach based on the describes a concept and each concept is usually described by agreement between the contextual information of candidate different languages. The articles written in different language entities and the query, as well as the agreement among the but describing the same concept are quite related in their category tags associated with candidate entities. Bunescu and hidden topic. It is interesting to use topic modeling Pasca [2006] trained a disambiguation SVM kernel on the algorithms to mine cross lingual topics from Wikipedia. Wikipedia data, the similarity between query and candidate Topic modeling has been widely used by various text entity was computed based on the SVM kernel, the linking applications [Phan et al., 2008; Titov and McDonald, 2008]. system thus selected the entity with the highest similarity as Typical topic modeling methods include LDA [Blei et al., the answer. Han and Zhao [2009] constructed a large-scale 2003]. LDA can be seen as a typical probabilistic approach to semantic network from Wikipedia, and computed the latent topic computation. Each topic is represented by a similarity between query and candidate entity based on the distribution of words (probabilistic language model) and constructed semantic network. Radford et al. [2010] applied each word has a probability score used to measure its a Graph-Based ranking method in their entity linking system, contribution to the topic. context entities were taken into account in order to reach a In this paper, we propose a novel bilingual topic model to global optimized solution. Han and Sun [2011] proposed a modeling bilingual topics from Wikipedia data in two generative probabilistic model, which can leverage the entity different languages. A topic is inherently bilingual: each has popularity knowledge, name knowledge, and context two types of representation and each representation knowledge for the entity linking task. Other researchers corresponds to one language. The same Wikipedia concept developed supervised learning approach for the entity linking described by different language follows that constrain of task. Milne and Witten [2008] proposed a machine learning sharing one identical topic distribution. Based on this approach to disambiguation that use the links found within framework, we represent the query and the candidate entity Wikipedia articles for training. The two main features they using as the distribution of semantic topic. The similarity of a used in their classifier were the commonness of each query and an candidate entity is computed as the cosine candidate entity and its relatedness to the surrounding similarity between their vectors in the semantic space. This is context. Dredze et al. [2010] employed a learning to rank illustrated in Figure 1. Experiments on the standard method to disambiguate candidate entities. A comprehensive cross-lingual entity linking dataset show that our approach is feature set was used in their system. Zhang et al. [2010] promising. Different from previous research work, our proposed a method to automate generate a large-scale corpus, approach does not require additional linguistic resources like which were used to train a binary classifier. Zhang et al. dictionaries or translation tools. In addition, once the topic [2011] employed a Wikipedia-LDA model to model the model is estimated, it can be applied to other cross lingual contexts as the probability distributions over Wikipedia task provided that they are consistent. categories and used this model to compute the similarity The reminder of this paper is organized as follows. In the between the query and the candidate entities. Such semantic next section we briefly discuss related work on entity linking similarity was considered as an important feature in their task. Section 3 describes our bilingual topic model in detail. framework. Dai et al. [2011] employed a Markov logic Cross-lingual entity linking is presented in Section 4. We network in handling this task, but they only conducted then present the experimental results in section 5. Section 6 experiments in biomedical domain. Shen et al. [2012] concludes the paper and presents the future work.