Cross-Lingual Joint Entity and Word Embedding to Improve Entity Linking and Parallel Sentence Mining

Cross-lingual Joint Entity and Word Embedding to Improve Entity Linking and Parallel Sentence Mining Xiaoman Pan∗, Thamme Gowdaz, Heng Ji∗y, Jonathan Mayz, Scott Millerz ∗ Department of Computer Science y Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {xiaoman6,hengji}@illinois.edu z Information Sciences Institute, University of Southern California {tg,jonmay,smiller}@isi.edu Abstract entity mention (e.g., a person name “Bill Gates”) is not a simple summation of the meanings of the Entities, which refer to distinct objects in the words it contains (“Bill” + “Gates”). Second, en- real world, can be viewed as language univer- tity mentions are often highly ambiguous in var- sals and used as effective signals to generate ious local contexts. For example, “Michael Jor- less ambiguous semantic representations and align multiple languages. We propose a novel dan” may refer to the basketball player or the com- method, CLEW, to generate cross-lingual data puter science professor. Third, representing entity that is a mix of entities and contextual words mentions as mere phrases fails when names are based on Wikipedia. We replace each an- rendered quite differently, especially when they chor link in the source language with its corre- appear across multiple languages. For example, sponding entity title in the target language if it “Ang Lee” in English is “Li An” in Chinese. exists, or in the source language otherwise. A Fortunately, entities, the objects which men- cross-lingual joint entity and word embedding learned from this kind of data not only can dis- tions refer to, are unique and equivalent across lan- ambiguate linkable entities but can also effec- guages. Many manually constructed entity-centric tively represent unlinkable entities. Because knowledge base resources such as Wikipedia2, this multilingual common space directly re- DBPedia (Auer et al., 2007) and YAGO (Suchanek lates the semantics of contextual words in the et al., 2007) are widely available. Even better, they source language to that of entities in the tar- are massively multilingual. For example, up to get language, we leverage it for unsupervised August 2018, Wikipedia contains 21 million inter- cross-lingual entity linking. Experimental re- 3 sults show that CLEW significantly advances language links between 302 languages. We pro- the state-of-the-art: up to 3.1% absolute F- pose a novel cross-lingual joint entity and word score gain for unsupervised cross-lingual en- (CLEW) embedding learning framework based on tity linking. Moreover, it provides reliable multilingual Wikipedia and evaluate its effective- alignment on both the word/entity level and ness on two practical NLP applications: Cross- the sentence level, and thus we use it to mine lingual Entity Linking and Parallel Sentence Min- parallel sentences for all 302 language pairs 2 ing. in Wikipedia.1 Wikipedia contains rich entity anchor links. As 小s 1 Introduction shown in Figure2, many mentions ( e.g., “ ” (Xiaomi)) in a source language are linked to the The sheer amount of natural language data pro- entities in the same language that they refer to vides a great opportunity to represent named en- (e.g., zh/小 s 科技 (Xiaomi Technology)), and tity mentions by their probability distributions, so some mentions are further linked to their corre- that they can be exploited for many Natural Lan- sponding English entities (e.g., Chinese mention guage Processing (NLP) applications. However, “ù果”(Apple) is linked to entity en/Apple_Inc. named entity mentions are fundamentally differ- in English). We replace each mention (anchor ent from common words or phrases in three as- link) in the source language with its corresponding pects. First, the semantic meaning of a named entity title in the target language if it exists, or in 1We make all software and resources publicly avail- 2https://www.wikipedia.org able for research purpose at http://panx27.github.io/ 3https://en.wikipedia.org/wiki/Help: wikiann. Interlanguage_links 56 Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo), pages 56–66 Hong Kong, China, November 3, 2019. c 2019 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 the source language otherwise. After this replace- Traditional approaches to derive training data ment, each entity mention is treated as a unique from Wikipedia usually replace each anchor link disambiguated entity, then we can learn joint en- with its anchor text, for example, “apple is a tech- tity and word embedding representations for the nology company.”. These methods have two lim- source language and target language respectively. itations: (1) Information loss: For example, the Furthermore, we leverage these shared target anchor text “apple” itself does not convey infor- language entities as pivots to learn a rotation ma- mation such as the entity is a company; (2) Ambi- trix and seamlessly align two embedding spaces guity (Faruqui et al., 2016): For example, the fruit into one by linear mapping. In this unified com- sense and the company sense of “apple” mistak- mon space, multiple mentions are reliably disam- enly share one surface form. Similar to previous biguated and grounded, which enables us to di- work (Wang et al., 2014; Tsai and Roth, 2016; Ya- rectly compute the semantic similarity between a mada et al., 2016), we replace each anchor link mention in a source language and an entity in a with its corresponding entity title, and thus treat target language (e.g., English), and thus we can each entity title as a unique word. For example, perform Cross-lingual Entity Linking in an unsu- “en/Apple_Inc. is a technology company.”. Us- pervised way, without using any training data. In ing this kind of data mix of entity titles and con- addition, considering each pair of Wikipedia arti- textual words, we can learn joint embedding of en- cles connected by an inter-language link as com- tities and words. parable documents, we use this multilingual common space to represent sentences and extract many en/Pear cashew pear cashew pear en/Cashew parallel sentence pairs. fruit fruit juice wine juice en/Apple The novel contributions of this paper are: apple apple computer microsoft en/Apple_Inc. computer • We develop a novel approach based on rich ibm company en/Microsoft steve microsoft steve anchor links in Wikipedia to learn cross- macintosh jobs en/Steve_Jobs lingual joint entity and word embedding, company jobs so that entity mentions across multiple lan- word word entity guages are disambiguated and grounded into Figure 1: Traditional word embedding (left), and joint one unified common space. entity and word embedding (right). • Using this joint entity and word embedding space, entity mentions in any language can The results from traditional word embedding be linked to an English knowledge base with- and joint entity and word embedding for “apple” out any annotation cost. We achieve state-of- are visualized through Principal Component Anal- the-art performance on unsupervised cross- ysis (PCA) in Figure1. Using the joint embed- lingual entity linking. ding we can successfully separate those words referring to fruit and others referring to compa- • We construct a rich resource of parallel sen- nies in the vector space. Moreover, the similar- 302 tences for 2 language pairs along with ac- ity can be computed based on entity-level instead curate entity alignment and word alignment. of word-level. For example, en/Apple_Inc and en/Steve_Jobs are close in the vector space be- 2 Approach cause they share many context words and entities. 2.1 Training Data Generation Moreover, the above approach can be easily extended to the cross-lingual setting by using Wikipedia contains rich entity anchor links. For Wikipedia inter-language links. We replace each example, in the following sentence from En- anchor link in a source language with its corre- glish Wikipedia markup: “[[Apple Inc.|apple]] sponding entity title in a target language if it exists, is a technology company.”, where [[Apple and otherwise replace each anchor link with its Inc.|apple]] is an anchor link that links the anchor corresponding entity title in the source language. text “apple” to the entity en/Apple_Inc.4 An example is illustrated in Figure2. 4In this paper, we use langcode/entity_title to rep- Using this approach, the entities in a target lan- resent entities in Wikipedia in each individual language. For example, en/* refers to an entity in English Wikipedia guage can be embedded along with words and the en.wikipedia.org/wiki/*. entities in a source language, as illustrated in Fig- 57 Example Chinese Wikipedia Sentence: pre-aligned target language word is minimized. [[⼩⽶科技|⼩⽶]] 被誉为中国的 [[苹果公司|苹果]] 。 For example, given a set of pre-aligned word pairs, link link we use X and Y to denote two aligned matrices langlink langlink zh/⼩⽶科技 None zh/苹果公司 en/Apple_Inc. which contain the embedding of the pre-aligned Generated Sentence: words from ZS and ZT respectively. A linear zh/⼩⽶科技被誉为中国的 en/Apple_Inc. 。 W (Xiaomi) (is) (known as) (Chinese) mapping can be learned such that: Figure 2: Using Wikipedia inter-language links to gen- argminjjWX − YjjF W erate sentences which contain words and entities in a source language (e.g., Chinese) and entities in a target Previous work (Xing et al., 2015; Smith et al., language (e.g., English). 2017) shows that enforcing an orthogonal con- straint W yields better performance. Conse- quently, the above equation can be transferred to en/Pear zh/ (Arbor) Orthogonal Procrustes problem (Conneau et al., (pear) (tree) en/Apple 2017): (apple) (fruit) > argminjjWX − YjjF = UV en/Apple_Inc. W (phone) (computer) Then W can be obtained from the singular value zh/ (Xiaomi) en/Microsoft decomposition (SVD) of YX> such that: (company) (microsoft) > > English entity Chinese entity Chinese word UΣV = SVD(YX ) Figure 3: Embedding which includes entities in En- In this paper, we propose using entities instead glish, and words and entities in Chinese (English words of pre-aligned words as anchors to learn such a in brackets are human translations of Chinese words).

Load more