Cross-Lingual Link Discovery Between Chinese and English Wiki Knowledge Bases
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC-27 Cross-lingual Link Discovery between Chinese and English Wiki Knowledge Bases Qingliang Miao, Huayu Lu, Shu Zhang, Yao Meng Fujitsu R&D Center Co., Ltd. No.56 Dong Si Huan Zhong Rd, Chaoyang District, Beijing P.R. China {qingliang.miao, zhangshu, mengyao}@cn.fujitsu.com [email protected] Abstract of monolingual and cross-lingual alignment in Chinese and English Wikipedia. As it can be Wikipedia is an online multilingual seen that there are 2.6 millions internal links encyclopedia that contains a very large within English Wikipedia and 0.32 millions number of articles covering most written internal links within Chinese Wikipedia, but only languages. However, one critical issue 0.18 millions links between Chinese Wikipedia for Wikipedia is that the pages in pages to English ones. For example, in Chinese different languages are rarely linked 武术 except for the cross-lingual link between Wikipedia page “ (Martial arts)”, anchors are pages about the same subject. This could only linked to related Chinese articles about pose serious difficulties to humans and different kinds of martial arts such as “ 拳击 machines who try to seek information (Boxing)”, “柔道(Judo)” and “击剑(Fencing)”. from different lingual sources. In order But, there is no anchors linked to other related to address above issue, we propose a English articles such as “Boxing”, “Judo and hybrid approach that exploits anchor “Fencing”. This makes information flow and strength, topic relevance and entity knowledge propagation could be easily blocked knowledge graph to automatically between articles of different languages. discovery cross-lingual links. In addition, we develop CELD, a system for automatically linking key terms in Chinese documents with English Concepts. As demonstrated in the experiment evaluation, the proposed model outperforms several baselines on the NTCIR data set, which has been designed especially for the cross-lingual link discovery evaluation. 1 Introduction Wikipedia is the largest multilingual encyclopedia online with over 19 million articles Figure 1. Statistics of English to English links in 218 written languages. However, the anchored (E2E), Chinese to Chinese links (C2C) and links in Wikipedia articles are mainly created Chinese to English links (C2E). within the same language. Consequently, Consequently, automatically creating cross- knowledge sharing and discovery could be lingual links between Chinese and English impeded by the absence of links between Wikipedia would be very useful in information different languages. Figure 1 shows the statistics flow and knowledge sharing. At present, there 374 Copyright 2013 by Qingliang Miao, Huayu Lu, Shu Zhang, and Yao Meng 27th Pacific Asia Conference on Language, Information, and Computation pages 374-381 PACLIC-27 are several monolingual link discovery tools for problem is non-trivial and poses a set of English Wikipedia, which assist topic curators in challenges. discovering prospective anchors and targets for a Linguistic complexity given Wikipedia pages. However, no such cross- lingual tools yet exist, that support the cross- Chinese Wikipedia is more complex, because lingual linking of documents from multiple contributors of Chinese Wikipedia are from languages (Tang et al., 2012). As a result, the different Chinese spoken geographic areas and work is mainly taken by manual, which is language variations. For example, Yue dialect1 is obviously tedious, time consuming, and error a primary branch of Chinese spoken in southern prone. China and Wu 2 is a Sino-Tibetan language One way to solve above issue is cross-lingual spoken in most of southeast. Moreover, these link discovery technology, which automatically contributors cite modern and ancient sources creates potential links between documents in combining simplified and traditional Chinese different languages. Cross-lingual link discovery text, as well as regional variants (Tang et al., not only accelerates the knowledge sharing in 2012). Consequently, it is necessary to normalize different languages on the Web, but also benefits words into simple Chinese before cross-lingual many practical applications such as information linking. retrieval and machine translation (Wang et al., Key Term Extraction 2012). In existing literature, a few approaches have been proposed for linking English There are different kinds of key term ranking Wikipedia to other languages (Kim and methods that could be used in key term Gurevych, 2011; Fahrni et al., 2011). Generally extraction, such as tf-idf, information gain, speaking, there are three steps for Cross-lingual anchor probability and anchor strength (Kim and link discovery: (1) Apply information extraction Gurevych, 2011). How to define a model to techniques to extract key terms from source incorporate both the global statistical language documents. (2) Utilize machine characteristics and topically related context translation systems to translate key terms and together? source documents into target language. (3) Apply Translation entity resolution methods to identify the corresponding concepts in target language. Key term translation could rely on bilingual However, in key term extraction step, most dictionary and machine translation. This kind of works rely on statistical characteristics of anchor methods could obtain high precision, while suffer text (Tang et al., 2012), but ignore the topic from low recall. When using larger dictionaries relevance. In this case, common concepts are or corpus for translation, it is prone to introduce selected as key terms, but these terms are not translation ambiguities. How to increase recall related to the topic of the Wikipedia page. For without introducing additional ambiguities? example, in Chinese Wikipedia page “ 武术 In order to solve the above challenges, we (Martial arts)”, some countries’ name such as investigate several important factors of cross- “ 中国(China)”, “ 日本(Japan)” and “ 韩国 lingual link discovery problem and propose a (Korea)” are also selected as key terms when hybrid approach to solve the above issues. Our using anchor statistics. For term translation, contributions include: existing methods usually depends on machine (1) We develop a normalization lexicon for translation, and suffers from translation errors, Chinese variant character. This lexicon could particularly those involving named entities, such be used for traditional and simplified Chinese as person names (Cassidy et al., 2012). Moreover, transformation and other variations machine translation systems are prone to normalization. We also discovery entity introduce translation ambiguities. In entity knowledge from Wikipedia, Chinese resolution step, some works use simple title encyclopedia, and then we build a knowledge matching to find concept in target languages, graph that includes mentions, concepts, which could not distinguish ambiguous entities translations and corresponding confidence effectively (Kim and Gurevych, 2011). scores. In this paper, we try to investigate the (2) We present an integrated model for key problem of cross-lingual link discovery from terms extraction, which leverages anchor Chinese Wikipedia pages to English ones. The 1 http://zh-yue.wikipedia.org 2 http://wuu.wikipedia.org 375 PACLIC-27 statistical probability information and topical the ID of the knowledge base entry to which the relevance. Efficient candidate selection name refers; or NIL if there is no such method and distinguishing algorithm enable knowledge base entry. Due to the intrinsic this model meet the real-time requirements. ambiguity of named entities, most works in (3) We implement a system and evaluate it entity linking task focus on named entity using NTCIR cross-lingual links discovery disambiguation. For example, Han and Sun dataset. Comparing with several baselines, (2012) propose a generative entity-topic model our system achieves high precision and recall. that effectively joins context compatibility and The remainder of the paper is organized as topic coherence. Their model can accurately follows. In the following section we review the disambiguate most mentions in a document using existing literature. Then, we formally introduce both the local information and the global the problem of cross-lingual link discovery and consistency. some related concepts in section 3. We introduce Following this research stream, researchers the proposed approach in section 4. We conduct have been paying more and more attention on comparative experiments and present the cross-lingual semantic annotation (CLSA). experiment results in section 5. At last, we Knowledge Base Population (KBP2011) conclude the paper with a summary of our work evaluations propose a cross-lingual entity link and give our future working directions. task, which aims to find link between Chinese queries and English concepts. NTCIR9 cross- 2 Related Works lingual link discovery task is another kind of cross-lingual semantic annotation. These two Generally speaking, link discovery is a kind of tasks are different in query selection criteria, semantic annotation (Kiryakov et al., 2004), leading to different technical difficulties and which is characterized as the dynamic creation of concerns. In KBP2011, key terms are manually interrelationships between concepts in selected to cover many ambiguous entities and knowledge base and mentions in unstructured or name variants. Consequently, disambiguation is semi-structured documents (Bontcheva and Rout, crucial in KBP2011. While in NTCIR9, 2012). participants have to extract key terms