Cross-lingual Joint Entity and to Improve Entity Linking and Parallel Sentence Mining

Xiaoman Pan∗, Thamme Gowda‡, Heng Ji∗†, Jonathan May‡, Scott Miller‡ ∗ Department of Computer Science † Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {xiaoman6,hengji}@illinois.edu ‡ Information Sciences Institute, University of Southern California {tg,jonmay,smiller}@isi.edu

Abstract entity mention (e.g., a person name “Bill Gates”) is not a simple summation of the meanings of the Entities, which refer to distinct objects in the words it contains (“Bill” + “Gates”). Second, en- real world, can be viewed as language univer- tity mentions are often highly ambiguous in var- sals and used as effective signals to generate ious local contexts. For example, “Michael Jor- less ambiguous semantic representations and align multiple languages. We propose a novel dan” may refer to the basketball player or the com- method, CLEW, to generate cross-lingual data puter science professor. Third, representing entity that is a mix of entities and contextual words mentions as mere phrases fails when names are based on . We replace each an- rendered quite differently, especially when they chor link in the source language with its corre- appear across multiple languages. For example, sponding entity title in the target language if it “Ang Lee” in English is “Li An” in Chinese. exists, or in the source language otherwise. A Fortunately, entities, the objects which men- cross-lingual joint entity and word embedding learned from this kind of data not only can dis- tions refer to, are unique and equivalent across lan- ambiguate linkable entities but can also effec- guages. Many manually constructed entity-centric tively represent unlinkable entities. Because knowledge base resources such as Wikipedia2, this multilingual common space directly re- DBPedia (Auer et al., 2007) and YAGO (Suchanek lates the semantics of contextual words in the et al., 2007) are widely available. Even better, they source language to that of entities in the tar- are massively multilingual. For example, up to get language, we leverage it for unsupervised August 2018, Wikipedia contains 21 million inter- cross-lingual entity linking. Experimental re- 3 sults show that CLEW significantly advances language links between 302 languages. We pro- the state-of-the-art: up to 3.1% absolute F- pose a novel cross-lingual joint entity and word score gain for unsupervised cross-lingual en- (CLEW) embedding learning framework based on tity linking. Moreover, it provides reliable multilingual Wikipedia and evaluate its effective- alignment on both the word/entity level and ness on two practical NLP applications: Cross- the sentence level, and thus we use it to mine lingual Entity Linking and Parallel Sentence Min- parallel sentences for all 302 language pairs 2 ing. in Wikipedia.1 Wikipedia contains rich entity anchor links. As 小米 1 Introduction shown in Figure2, many mentions ( e.g., “ ” (Xiaomi)) in a source language are linked to the The sheer amount of natural language data pro- entities in the same language that they refer to vides a great opportunity to represent named en- (e.g., zh/小 米 科 技 (Xiaomi Technology)), and tity mentions by their probability distributions, so some mentions are further linked to their corre- that they can be exploited for many Natural Lan- sponding English entities (e.g., Chinese mention guage Processing (NLP) applications. However, “苹果”(Apple) is linked to entity en/Apple_Inc. named entity mentions are fundamentally differ- in English). We replace each mention (anchor ent from common words or phrases in three as- link) in the source language with its corresponding pects. First, the semantic meaning of a named entity title in the target language if it exists, or in

1We make all software and resources publicly avail- 2https://www.wikipedia.org able for research purpose at http://panx27.github.io/ 3https://en.wikipedia.org/wiki/Help: wikiann. Interlanguage_links

56 Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo), pages 56–66 Hong Kong, China, November 3, 2019. c 2019 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 the source language otherwise. After this replace- Traditional approaches to derive training data ment, each entity mention is treated as a unique from Wikipedia usually replace each anchor link disambiguated entity, then we can learn joint en- with its anchor text, for example, “apple is a tech- tity and word embedding representations for the nology company.”. These methods have two lim- source language and target language respectively. itations: (1) Information loss: For example, the Furthermore, we leverage these shared target anchor text “apple” itself does not convey infor- language entities as pivots to learn a rotation ma- mation such as the entity is a company; (2) Ambi- trix and seamlessly align two embedding spaces guity (Faruqui et al., 2016): For example, the fruit into one by linear mapping. In this unified com- sense and the company sense of “apple” mistak- mon space, multiple mentions are reliably disam- enly share one surface form. Similar to previous biguated and grounded, which enables us to di- work (Wang et al., 2014; Tsai and Roth, 2016; Ya- rectly compute the semantic similarity between a mada et al., 2016), we replace each anchor link mention in a source language and an entity in a with its corresponding entity title, and thus treat target language (e.g., English), and thus we can each entity title as a unique word. For example, perform Cross-lingual Entity Linking in an unsu- “en/Apple_Inc. is a technology company.”. Us- pervised way, without using any training data. In ing this kind of data mix of entity titles and con- addition, considering each pair of Wikipedia arti- textual words, we can learn joint embedding of en- cles connected by an inter-language link as com- tities and words. parable documents, we use this multilingual com- mon space to represent sentences and extract many en/Pear cashew pear cashew pear en/Cashew parallel sentence pairs. fruit fruit juice wine juice en/Apple The novel contributions of this paper are: apple apple computer microsoft en/Apple_Inc. computer • We develop a novel approach based on rich ibm company en/Microsoft steve microsoft steve anchor links in Wikipedia to learn cross- macintosh jobs en/Steve_Jobs lingual joint entity and word embedding, company jobs so that entity mentions across multiple lan- word word entity guages are disambiguated and grounded into Figure 1: Traditional word embedding (left), and joint one unified common space. entity and word embedding (right). • Using this joint entity and word embedding space, entity mentions in any language can The results from traditional word embedding be linked to an English knowledge base with- and joint entity and word embedding for “apple” out any annotation cost. We achieve state-of- are visualized through Principal Component Anal- the-art performance on unsupervised cross- ysis (PCA) in Figure1. Using the joint embed- lingual entity linking. ding we can successfully separate those words referring to fruit and others referring to compa- • We construct a rich resource of parallel sen- nies in the vector space. Moreover, the similar- 302 tences for 2 language pairs along with ac- ity can be computed based on entity-level instead curate entity alignment and word alignment. of word-level. For example, en/Apple_Inc and en/Steve_Jobs are close in the vector space be- 2 Approach cause they share many context words and entities. 2.1 Training Data Generation Moreover, the above approach can be easily extended to the cross-lingual setting by using Wikipedia contains rich entity anchor links. For Wikipedia inter-language links. We replace each example, in the following sentence from En- anchor link in a source language with its corre- glish Wikipedia markup: “[[Apple Inc.|apple]] sponding entity title in a target language if it exists, is a technology company.”, where [[Apple and otherwise replace each anchor link with its Inc.|apple]] is an anchor link that links the anchor corresponding entity title in the source language. text “apple” to the entity en/Apple_Inc.4 An example is illustrated in Figure2. 4In this paper, we use langcode/entity_title to rep- Using this approach, the entities in a target lan- resent entities in Wikipedia in each individual language. For example, en/* refers to an entity in English Wikipedia guage can be embedded along with words and the en.wikipedia.org/wiki/*. entities in a source language, as illustrated in Fig-

57 Example Chinese Wikipedia Sentence: pre-aligned target language word is minimized. [[⼩⽶科技|⼩⽶]] 被 誉为 中国的 [[苹果公司|苹果]] 。 For example, given a set of pre-aligned word pairs, link link we use X and Y to denote two aligned matrices langlink langlink zh/⼩⽶科技 None zh/苹果公司 en/Apple_Inc. which contain the embedding of the pre-aligned Generated Sentence: words from ZS and ZT respectively. A linear zh/⼩⽶科技 被 誉为 中国的 en/Apple_Inc. 。 W (Xiaomi) (is) (known as) (Chinese) mapping can be learned such that:

Figure 2: Using Wikipedia inter-language links to gen- argmin||WX − Y||F W erate sentences which contain words and entities in a source language (e.g., Chinese) and entities in a target Previous work (Xing et al., 2015; Smith et al., language (e.g., English). 2017) shows that enforcing an orthogonal con- straint W yields better performance. Conse- quently, the above equation can be transferred to en/Pear zh/ (Arbor) Orthogonal Procrustes problem (Conneau et al., (pear) (tree) en/Apple 2017): (apple) (fruit) > argmin||WX − Y||F = UV en/Apple_Inc. W (phone) (computer) Then W can be obtained from the singular value zh/ (Xiaomi) en/Microsoft decomposition (SVD) of YX> such that: (company) (microsoft) > > English entity Chinese entity Chinese word UΣV = SVD(YX )

Figure 3: Embedding which includes entities in En- In this paper, we propose using entities instead glish, and words and entities in Chinese (English words of pre-aligned words as anchors to learn such a in brackets are human translations of Chinese words). linear mapping W. The basic idea is illustrated in Figure4. We use ET and WT to denote the sets of entities and words in the target language asso- ure3. ciated with the target entity and word embedding This joint representation has two advantages: ZT : (1) Disambiguation: For example, two entities en/Apple_Inc. and en/Apple can be differenti- ZT = {zt , .., zt , zt , .., zt } e1 e|E | w1 w|W | ated by their distinct neighbors “电脑”(computer) T T and “水果”(fruit) respectively. (2) Effective rep- Similarly, we use ES and WS to denote the sets of resentation of unknown entities: For example, entities and words in the source language associ- the new entity zh/小米科技 (Xiaomi Technology), ated with the source entity and word embedding a Chinese mobile phone manufacturer, may not ZS: have an English Wikipedia page yet. But because S s s s s it’s close to neighbors such as en/Microsoft,“手 Z = {ze , .., ze , zw , .., zw } 1 |ES | 1 |WS | 机”(phone) and “公司”(company), we can infer 0 it’s likely to be a technology company. and use E T to denote the set of entities in the source language which are replaced with the cor- 2.2 Linear Mapping across Languages responding entities in the target language, where E0 ∈ E . Then ZS can be represented as Word embedding spaces have similar geometric T T arrangements across languages (Mikolov et al., S t0 t0 s s Z = {ze1 , .., ze 0 , ze1 , .., ze 0 , 2013b). Given two sets of independently trained |E T | |ES |−|E T | s s word embedding, the source language embedding zw , .., zw } 1 |WS | ZS and the target language embedding ZT , and t t0 a set of pre-aligned word pairs, a linear mapping Note that zei and zei are the embedding of ei in W is learned to transform ZS into a shared space ZT and ZS respectively. Therefore, using entities 0 where the distance between the embedding of the in E T as anchors, we can learn a linear mapping source language word and the embedding of its W that maps ZS into the vector space of ZT , and

58 Chinese word Chinese entity English word English entity

pear pear

en/Apple en/Apple en/Apple fruit fruit en/Apple wine wine

zh/ zh/ ) (Xiaomi (Xiaomi phone (Xiaomi ) ) phone zh/ macintosh en/Apple_Inc. en/Apple_Inc. en/Apple_Inc.macintosh en/Apple_Inc. company company

W X Y WX WX Y

Figure 4: Using the aligned entities as anchors to learn a linear mapping (rotation matrix) which maps a source language embedding space to a target language embedding space. obtain the cross-lingual joint entity and word em- 3.1 Unsupervised Cross-lingual Entity bedding Z. Linking We adopt the refinement procedure proposed Cross-lingual Entity Linking aims to link an entity by Conneau et al.(2017) to improve the quality mention in a source language text to its referent of W. A set of new high-quality anchors is gen- entity in a knowledge base (KB) in a target lan- erated to refine W learned from E0 . High-quality T guage (e.g., English Wikipedia). A typical Cross- anchors refer to entities that have high frequency lingual Entity Linking framework includes three (e.g., top 5,000) and entities that are mutual near- steps: mention translation, entity candidate gener- est neighbors. We iteratively apply this procedure ation, and mention disambiguation. We use trans- to optimize W. At each iteration, the new high- lation dictionaries collected from Wikipedia (Ji quality anchors are exploited to learn a new map- et al., 2009) to translate each mention into English. ping. If a mention has multiple translations, we merge Conneau et al.(2017) also propose a novel the linking results of all translations at the end. We comparison metric, Cross-domain Similarity Lo- adopt a dictionary-based approach (Medelyan and cal Scaling (CSLS), to relieve the hubness phe- Legg, 2008) to generate entity candidates for each nomenon, where some vectors (hubs) are the near- mention. Then we use CLEW to implement the est neighbors of many others. For example, entity following two widely used mention disambigua- is a hub in the vector space. en/United_States tion features: Context Similarity and Coherence. By employing this metric, the similarity of iso- Context Similarity refers to the context simi- lated vectors is increased, while the similarity of larity between a mention and a candidate entity. vectors in dense areas is decreased. Specifically, Given a mention m, we consider the entire sen- given a mapped source embedding Wx and a tar- tence containing m as its local context. Using get embedding y, the mean cosine similarity of CLEW embedding Z, the vectors of context words Wx and y for their K nearest neighbors in the are averaged to obtain the context vector represen- other language, r (Wx) and r (y) are computed T S tation of m: respectively. The comparison metric is defined as follows: 1 X vm = zw |Wm| CSLS(Wx, y) = cos(Wx, y) − rT (Wx) w∈Wm −rS(y) where Wm is the set of context words of m, and Conneau et al.(2017) show that the performance zw ∈ Z is the embedding of the context word is essentially the same when K = 5, 10, 50. Fol- w. We measure context similarity between m lowing this work, we set K = 10. and each of its entity candidates by using the co- 3 Downstream Applications sine similarity between vm and entity embedding ze ∈ Z such that: We apply CLEW to enhance two important down- stream tasks: Cross-lingual Entity Linking and vm · ze Ftxt(e) = cos(vm, ze) = Parallel Sentence Mining. kvmk kzek

59 Feature Description F (e) Entity Prior: |Ae,∗| , where A is a set of anchor links that link to entity e and A is prior |A∗,∗| e,∗ ∗,∗ all anchor links in the KB F (e|m) Mention to Entity Probability: |Ae,m| , where A is a set of anchor links with anchor prob |A∗,m| ∗,m text m and Ae,m is a subset that links to entity e. p(e|m) Ftype(e|m, t) Entity Type (Ling et al., 2015): P , where e 7→ t indicates that t is one of e’s e7→t p(e|m) entity types. Conditional probability p(e|m) can be estimated by Fprob(e|m).

Table 1: Mention disambiguation features.

Coherence is driven by the assumption that if Wikipedia contributors tend to translate some multiple mentions appear together within a con- content from existing articles in other languages text window, their referent entities are more likely while editing an article. Therefore, if there exists to be strongly connected to each other in the KB. an inter-language link between two Wikipedia ar- Previous work (Cucerzan, 2007; Milne and Wit- ticles in different languages, these two articles can ten, 2008; Hoffart et al., 2011; Ratinov et al., 2011; be considered comparable and thus they are very Cheng and Roth, 2013; Ceccarelli et al., 2013; likely to contain parallel sentences. We represent a Ling et al., 2015) considers the KB as a knowl- Wikipedia sentence in any of the 302 languages by edge graph and models coherence based on the aggregating the embedding of entities and words it overlapped neighbors of two entities in the knowl- contains. In order to penalize high frequent words edge graph. These approaches heavily rely on and entities, we apply a weighted metric: explicit connections among entities in the knowl- edge graph and thus cannot capture the coher- ence between two entities that are implicitly con-  |S|  nected. For example, two entities en/Mosquito IDF(t, S) = log |{s ∈ S : t ∈ s}| and en/Cockroach only have very few over- lapped neighbors in the , but they usually appear together and have similar contexts where t is a term (entity or word), S is an article in text. Using CLEW embedding Z, the coher- containing |S| sentences, and |{s ∈ S : t ∈ s}| ence score can be estimated by cosine similarity is the total number of sentences containing t. The between the embedding of two entities. This co- embedding of a sentence vs can be computed as: herence metric pays more attention to semantics. We consider mentions that appear in the same 1 X sentence as coherent. Let m be a mention, and vs = IDF(t, S) · zt |Ts| t∈T Ce be the set of corresponding entity candidates of s m’s coherent mentions. The coherence score for each of m’s entity candidates is the average: where Ts is the set of terms of s and zt ∈ Z is the embedding of t. 1 X Fcoh(e) = cos(ze, zce ) |Ce| Given two comparable Wikipedia articles con- ce∈Ce nected by an inter-language link, we compute the Finally, we linearly combine these two features similarity of all possible sentence pairs using the with several other common mention disambigua- CSLS metric described in Section 2.2 and rank tion features as shown in Table1. them. If the CSLS score of a sentence pair is greater than a threshold (in this paper, we empir- 3.2 Parallel Sentence Mining ically set the threshold to 0.1 based on a separate One major bottleneck of low-resource language small development set), then the sentence pair is machine translation is the lack of parallel sen- considered as parallel. An advantage of our ap- tences. This inspires us to mine parallel sentences proach is that it provides a similarity score for ev- from Wikipedia automatically using CLEW em- ery term pair, which can be used for improving bedding Z. word alignment and entity alignment.

60 4 Experiments 4.3 Cross-lingual Entity Linking We use the training set and evaluation set 4.1 Training Data (LDC2015E75 and LDC2015E103) in TAC We use an April 1, 2018 Wikipedia XML dump to Knowledge Base Population (TAC-KBP) 2015 generate data to train the joint entity and word em- Tri-lingual Entity Linking Track (Ji et al., 2015) bedding. We only select and analyze those main for the cross-lingual entity linking experiments, Wikipedia pages (ns tag is 0) which are not redi- because these data sets include the most recent and rected (redirect tag is None) using the approach comprehensive gold-standard annotations on this described in Section 2.1. We use the Skip-gram task and we can compare our model with previ- model in (Mikolov et al., 2013a,c) to ously reported state-of-the-art approaches on the learn the unaligned embeddings. The number of same benchmark. dimensions of the embedding is set to 300, and We first compare our unsupervised approach to the minimal number of occurrences, the size of the the top TAC2015 unsupervised system reported context window, and the learning rate are set to 5, by Ji et al.(2015). In order to have a fair com- 5, and 0.025 respectively. parison with the state-of-the-art supervised meth- ods, we also combine the features as described 4.2 Linear Mapping in Section 3.1 in a point-wised learning to rank algorithm based on Gradient Boosted Regression A large number of aligned entities can be obtained Trees (Friedman, 2000). The learning rate and the using the approach described in Section 2.1. For maximum depth of the decision trees are set to example, there are about 400,000 aligned entities 0.01 and 4 respectively. The results are shown in between English and Spanish. However, the map- Table3. We can see that our unsupervised and su- ping algorithm does not perform well if we try to pervised approaches significantly outperform the align all anchors, because the embedding of rare best TAC15 systems. entities is updated less often, and thus their con- texts are very different across languages. There- Method ENG CMN SPA fore, we learn the global mapping using only high- Best TAC15 Unsupervised 67.1 78.1 71.5 quality anchors, and select high-frequency enti- Our Unsupervised 70.0 81.2 73.4 ties only as anchors using the salience metric de- w/o Context Similarity 66.9 79.0 70.6 scribed in Table1. We use 5,000 anchors for train- w/o Coherence 68.5 78.6 71.4 ing and 1,500 anchors for testing for each lan- Best TAC15 Supervised 73.7 83.1 80.4 guage pair. Our proposed method is applied to 9 (Tsai and Roth, 2016) - 83.6 80.9 language pairs in our experiments. Table2 shows (Sil et al., 2017) - 84.4 82.3 the statistics and the performance. We can see that Our Supervised 74.8 84.2 82.1 mapping a language to its related language (e.g., w/o Context Similarity 72.2 80.4 79.5 w/o Coherence 73.3 82.1 77.8 Ukrainian to Russian) usually achieves better per- formance. Table 3: F1 (%) of the evaluation set in TAC KBP 2015 Tri-lingual Entity Linking Track (Ji et al., 2015) (ENG: Source-Target P@1 P@5 P@10 English, CMN: Chinese, SPA: Spanish). es-en 79.1 89.2 92.3 it-en 74.5 86.9 90.5 We further observe that Context Similarity and ru-en 68.4 82.8 86.7 Coherence features derived from Z play signif- tr-en 59.0 79.9 86.3 icant roles. Without such features, the perfor- uk-en 63.0 79.7 85.9 mance drops significantly, as shown in Table3. zh-en 63.1 83.8 89.2 For example, in the following sentence: “欧 uk-ru 78.1 90.3 92.8 盟 委 员 会 副 主 席 雷雷雷 丁丁丁 就 此 表 示 ru-uk 75.8 90.2 93.7 ... (Euro- pean Commission vice president Redding said Table 2: Linear entity mapping statistics and perfor- that...)”, without Context Similarity feature, men- mance (Precision (%) at K) (en: English, es: Spanish, tion “雷雷雷 丁丁丁(Redding)” is likely to be linked to it: Italian, ru: Russian, so: Somali, tr: Turkish, uk: the football club en/Reading_F.C. or the city Ukrainian, zh: Chinese). en/Redding,_California. Using contextual words such as “委 员 会(commission)” and “主

61 席(president)”, we can successfully link this men- Language Pairs Prefect Partial Word Entity tion to the target entity en/Viviane_Reding. Chinese-English 81% 10% 92.3% 95.5% Spanish-English 75% 13% 89.7% 91.1% 4.4 Parallel Sentence Mining Russian-Ukrainian 70% 16% 82.4% 90.3% The proposed parallel sentence mining approach Table 5: Quality of the mined parallel sentences (Per- can be applied to any two languages in Wikipedia. fect and Partial stand for the percentage of perfect and Therefore, we have mined parallel sentences from partial respectively; Word and Entity stand for the Ac- 302 a total number of 2 language pairs and made curacy of word and entity alignments respectively). this data set publicly available for research pur- pose. Table4 shows some examples of mined par- Transformer model (Vaswani et al., 2017) im- allel sentences from Wikipedia, with word and en- plemented by Tensor2Tensor5. Our Transformer tity alignment highlighted. model has 6 encoder and decoder layers, 8 at- tention heads, 512-dimension hidden states, 2048- Amharic - English dimension feed-forward layers, dropout of 0.1 and * ዓርብ የሳምንቱ ስድስተኛ ቀን ሲሆን ሐሙስ በኋላ ቅዳሜ በፊት ይገኛል ። label smoothing of 0.1. The model is trained up to * Friday is the day after Thursday and the day before Saturday . 128,000 optimizer steps. Yoruba - English Using the NMT model as a black box, we per- * Glasgow ni ilu totobijulo ni orile-ede Skotlandi ati eyi totobijulo keta ni Britani . form two experiments using the following training * Glasgow is the largest city in Scotland , and third largest in and tuning settings: the United Kingdom . Uyghur - English • Baseline: 44,000 training and 1,000 tun- ing sentences randomly sampled from the . ﺟﯜﻣ ، ﭘﯾﺸﻧﺒ ﺑﻠن ﺷﻧﺒ ﺋﻮﺗﺘﯘرﺳﺪﻜﻰ ، ھﭘﺘﻨﯔ ﺑﺷﻨﭽﻰ ﻛﯜﻧﺪۇر * * Friday is the day after Thursday and the day before Saturday . WMT17 News Commentary v12 Russian- Vietnamese - English English Corpus (Bojar et al., 2016). * Bardolph là một làng thuộc quận McDonough , tiểu bang Illinois , HoaKỳ . * Bardolph is a village in McDonough County , Illinois , United • Our approach: Adding 44,000 training States . and 1,000 tuning sentences mined from Russian - Ukrainian Wikipedia using CLEW. * Статья 2 - я Конституции СССР 1977 года провозглашала : « Вся власть в СССР принадлежит народу . Using 1,000 randomly selected sentences from * Стаття 2 - га Конституції СРСР 1977 року проголошувала : WMT 17 corpus for testing, the baseline achieves " Вся влада в СРСР належить народові . (Article 2 of the Constitution of the USSR in 1977 proclaimed: 19.0% BLEU score while our approach achieves "All power in the USSR belongs to the people.") 20.8% BLEU score. Classical Chinese - Modern Chinese * ⾄⼆战之时,南斯拉夫屡败,终为德意志、义⼤利所分。 5 Related Work * 在⼆次世界⼤战期间,南斯拉夫多次战败,分别被德国、 意⼤利占领。 Cross-lingual Word Embedding Learning. (During the World War II, Yugoslavia was defeated several times and was occupied by Germany and Italy.) Mikolov et al.(2013b) first notice that word embedding spaces have similar geometric ar- Table 4: Examples of mined parallel sentences from rangements across languages. They use this Wikipedia. A portion of alignments are highlighted us- property to learn a linear mapping between two ing the same colors. spaces. After that, several methods attempt to improve the mapping (Faruqui and Dyer, 2014; We randomly select 100 mined parallel sentence Xing et al., 2015; Lazaridou et al., 2015; Ammar pairs for each of 3 language pairs, and ask linguis- et al., 2016; Artetxe et al., 2017; Smith et al., tic experts to judge the quality of these sentence 2017). The measures used to compute similarity pairs (perfect, partial, or not parallel). The results between a foreign word and an English word often are shown in Table5. We can see that the qual- include distributed monolingual representations ity of mined parallel sentence is promising and the on character-level (Costa-jussà and Fonollosa, quality of word and entity alignment is decent. 2016; Luong and Manning, 2016), subword- Furthermore, we evaluate the quality of mined level (Anwarus Salam et al., 2012; Rei et al., parallel sentences extrinsically using a neural ma- 5https://github.com/tensorflow/ chine translation (NMT) model. We use the tensor2tensor

62 2016; Sennrich et al., 2016; Yang et al., 2017), Parallel Sentence Mining. Automatic mining and bi-lingual word embedding (Madhyastha parallel sentences from comparable documents is and España-Bonet, 2017). Recent attempts have an important and useful task to improve Statis- shown that it is possible to derive cross-lingual tical Machine Translation. Early efforts mainly word embedding from unaligned corpora in exploited bilingual word dictionaries for boot- an unsupervised fashion (Zhang et al., 2017; strapping (Fung and Cheung, 2004). Recent ap- Conneau et al., 2017; Artetxe et al., 2018). proaches are mainly based on bilingual word em- Another strategy for cross-lingual word em- beddings (Marie and Fujita, 2017) and sentence bedding learning is to combine monolingual and embeddings (Schwenk, 2018) to detect sentence cross-lingual training objectives (Zou et al., 2013; pairs or continuous parallel segments (Hangya and Klementiev et al., 2012; Luong et al., 2015; Am- Fraser, 2019). To the best of our knowledge, this is mar et al., 2016; Vulic´ et al., 2017). Compared to the first work to incorporate joint entity and word our direct mapping approach, these methods gen- embedding into parallel sentence mining. As a re- erally require large size of parallel data. sult the sentence pairs we include reliable align- ment between entity mentions which are often Our work is largely inspired from (Conneau out-of-vocabulary and ambiguous and thus receive et al., 2017). However, our work focuses on better poor alignment quality from previous methods. representing entities, which are fundamentally dif- ferent from common words or phrases in many as- 6 Conclusions and Future Work pects as described in Section1. Previous multilin- gual word embedding efforts including (Conneau We developed a simple yet effective framework to et al., 2017) do not explicitly handle entity repre- learn cross-lingual joint entity and word embed- sentations. Moreover, we perform comprehensive ding based on rich anchor links in Wikipedia. The extrinsic evaluations based on down-stream NLP learned embedding strongly enhances two down- applications including cross-lingual entity linking stream applications: cross-lingual entity linking and machine translation, while previous work on and parallel sentence mining. The results demon- cross-lingual embedding only focused on intrinsic strate that our proposed method advances the evaluations. state-of-the-art for unsupervised cross-lingual en- tity linking task. We have also constructed a valu- Cross-lingual Joint Entity and Word Embed- able repository of parallel sentences for all lan- ding Learning. Previous work on cross-lingual guage pairs in Wikipedia to share with the commu- joint entity and word embedding methods largely nity. In the future, we will extend the framework neglect unlinkable entities (Tsai and Roth, 2016) to capture better representation of other types of and heavily rely on parallel or comparable sen- knowledge elements such as relations and events. tences (Cao et al., 2018). Tsai and Roth(2016) ap- ply a similar approach to generate code-switched Acknowledgments data from Wikipedia, but their framework does not keep entities in the source language. Using all This research is based upon work supported in part aligned entities as a dictionary, they adopt canon- by U.S. DARPA LORELEI Program HR0011- ical correlation analysis to project two embed- 15-C-0115, the Office of the Director of Na- ding spaces into one. In contrast, we only choose tional Intelligence (ODNI), Intelligence Advanced salient entities as anchors to learn a linear map- Research Projects Activity (IARPA), via con- ping. Cao et al.(2018) generate comparable data tract FA8650-17-C-9116, and ARL NS-CTA No. via distant supervision over multilingual knowl- W911NF-09-2-0053. The views and conclusions edge bases, and use an entity regularizer and a contained herein are those of the authors and sentence regularizer to align cross-lingual words should not be interpreted as necessarily represent- and entities. Further, they design knowledge atten- ing the official policies, either expressed or im- tion and cross-lingual attention to refine the align- plied, of DARPA, ODNI, IARPA, or the U.S. Gov- ment. Essentially, they train cross-lingual embed- ernment. The U.S. Government is authorized to ding jointly, while we align two embedding spaces reproduce and distribute reprints for governmen- that trained independently. Moreover, compared tal purposes notwithstanding any copyright anno- to their approach that relies on comparable data, tation therein. aligned entities are easier to acquire.

63 References Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Word translation without parallel data. arXiv Guillaume Lample, Chris Dyer, and Noah A. Smith. preprint arXiv:1710.04087. 2016. Massively multilingual word embeddings. CoRR , abs/1602.01925. Marta R. Costa-jussà and José A. R. Fonollosa. 2016. Character-based neural machine translation. In Pro- Khan Md. Anwarus Salam, Setsuo Yamada, and Tet- ceedings of the 54th Annual Meeting of the Associa- suro Nishino. 2012. Sublexical translations for low- tion for Computational Linguistics (Volume 2: Short resource language. In Proceedings of the Workshop Papers) on Machine Translation and Parsing in Indian Lan- , pages 357–361, Berlin, Germany. Associa- tion for Computational Linguistics. guages, pages 39–52, Mumbai, India. The COLING 2012 Organizing Committee. Silviu Cucerzan. 2007. Large-scale named entity dis- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. ambiguation based on Wikipedia data. In Proceed- Learning bilingual word embeddings with (almost) ings of the 2007 Joint Conference on Empirical no bilingual data. In Proceedings of the 55th Annual Methods in Natural Language Processing and Com- Meeting of the Association for Computational Lin- putational Natural Language Learning (EMNLP- guistics (Volume 1: Long Papers), pages 451–462. CoNLL), pages 708–716, Prague, Czech Republic. Association for Computational Linguistics. Association for Computational Linguistics.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Manaal Faruqui and Chris Dyer. 2014. Improving vec- A robust self-learning method for fully unsupervised tor space word representations using multilingual cross-lingual mappings of word embeddings. In correlation. In Proceedings of the 14th Conference Proceedings of the 56th Annual Meeting of the As- of the European Chapter of the Association for Com- sociation for Computational Linguistics (Volume 1: putational Linguistics, pages 462–471. Association Long Papers), pages 789–798. Association for Com- for Computational Linguistics. putational Linguistics. Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, Sören Auer, Christian Bizer, Georgi Kobilarov, Jens and Chris Dyer. 2016. Problems with evaluation Lehmann, Richard Cyganiak, and Zachary Ives. of word embeddings using word similarity tasks. 2007. Dbpedia: A nucleus for a web of open data. In Proceedings of the 1st Workshop on Evaluating In The Semantic Web, pages 722–735, Berlin, Hei- Vector-Space Representations for NLP, pages 30– delberg. Springer Berlin Heidelberg. 35, Berlin, Germany. Association for Computational Linguistics. Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, An- Jerome H. Friedman. 2000. Greedy function approx- tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- imation: A gradient boosting machine. Annals of gacheva, Christof Monz, et al. 2016. Findings of Statistics, 29:1189–1232. the 2016 conference on machine translation. In Pro- ceedings of the First Conference on Machine Trans- Pascale Fung and Percy Cheung. 2004. Mining very- lation: Volume 2, Shared Task Papers, pages 131– non-parallel corpora: Parallel sentence and lexicon 198. Association for Computational Linguistics. extraction via bootstrapping and e. In Proceed- Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, ings of the 2004 Conference on Empirical Meth- Chengjiang Li, Xu Chen, and Tiansi Dong. 2018. ods in Natural Language Processing, pages 57–63, Joint representation learning of cross-lingual words Barcelona, Spain. Association for Computational and entities via attentive distant supervision. In Pro- Linguistics. ceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 227– Viktor Hangya and Alexander Fraser. 2019. Unsuper- 237. Association for Computational Linguistics. vised parallel sentence extraction with parallel seg- ment detection helps machine translation. In Pro- Diego Ceccarelli, Claudio Lucchese, Salvatore Or- ceedings of the 57th Annual Meeting of the Asso- lando, Raffaele Perego, and Salvatore Trani. 2013. ciation for Computational Linguistics, pages 1224– Learning relatedness measures for entity linking. In 1234, Florence, Italy. Association for Computa- Proceedings of the 22Nd ACM International Con- tional Linguistics. ference on Information & Knowledge Management, CIKM ’13, pages 139–148, New York, NY, USA. Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bor- ACM. dino, Hagen Fürstenau, Manfred Pinkal, Marc Span- iol, Bilyana Taneva, Stefan Thater, and Gerhard Xiao Cheng and Dan Roth. 2013. Relational inference Weikum. 2011. Robust disambiguation of named for wikification. In Proceedings of the 2013 Con- entities in text. In Proceedings of the 2011 Con- ference on Empirical Methods in Natural Language ference on Empirical Methods in Natural Language Processing, pages 1787–1796, Seattle, Washington, Processing, pages 782–792. Association for Com- USA. Association for Computational Linguistics. putational Linguistics.

64 Heng Ji, Ralph Grishman, Dayne Freitag, Matthias Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Blume, John Wang, Shahram Khadivi, Richard Dean. 2013a. Efficient estimation of word represen- Zens, and Hermann Ney. 2009. Name extraction tations in vector space. CoRR. and translation for distillation. Handbook of Natu- ral Language Processing and Machine Translation: Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. DARPA Global Autonomous Language Exploitation. 2013b. Exploiting similarities among languages for machine translation. CoRR. Heng Ji, Joel Nothman, Ben Hachey, and Radu Flo- rian. 2015. Overview of tac-kbp2015 tri-lingual en- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- tity discovery and linking. In Proc. Text Analysis rado, and Jeff Dean. 2013c. Distributed representa- Conference (TAC2015). tions of words and phrases and their compositional- ity. In Advances in Neural Information Processing Alexandre Klementiev, Ivan Titov, and Binod Bhat- Systems 26. tarai. 2012. Inducing crosslingual distributed rep- resentations of words. In Proceedings of COLING D. Milne and I.H. Witten. 2008. Learning to link with 2012, pages 1459–1474. The COLING 2012 Orga- wikipedia. In Proc. ACM international conference nizing Committee. on Information and knowledge management (CIKM 2008). Angeliki Lazaridou, Georgiana Dinu, and Marco Ba- roni. 2015. Hubness and pollution: Delving into Lev Ratinov, Dan Roth, Doug Downey, and Mike An- cross-space mapping for zero-shot learning. In Pro- derson. 2011. Local and global algorithms for dis- ceedings of the 53rd Annual Meeting of the Associ- ambiguation to Wikipedia. In Proceedings of the ation for Computational Linguistics and the 7th In- 49th Annual Meeting of the Association for Com- ternational Joint Conference on Natural Language putational Linguistics: Human Language Technolo- Processing (Volume 1: Long Papers), pages 270– gies, pages 1375–1384, Portland, Oregon, USA. As- 280. Association for Computational Linguistics. sociation for Computational Linguistics. Marek Rei, Gamal Crichton, and Sampo Pyysalo. 2016. Xiao Ling, Sameer Singh, and Daniel S. Weld. 2015. Attending to characters in neural sequence label- Design challenges for entity linking. Transactions ing models. In Proceedings of COLING 2016, of the Association for Computational Linguistics, the 26th International Conference on Computational 3:315–328. Linguistics: Technical Papers, pages 309–318, Os- Minh-Thang Luong and Christopher Manning. 2016. aka, Japan. The COLING 2016 Organizing Commit- Achieving open vocabulary neural machine trans- tee. lation with hybrid word-character models. In Pro- Holger Schwenk. 2018. Filtering and mining parallel ceedings of ACL2016. data in a joint multilingual space. In Proceedings of the 56th Annual Meeting of the Association for Thang Luong, Hieu Pham, and Christopher D. Man- Computational Linguistics. ning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the Rico Sennrich, Barry Haddow, and Alexandra Birch. 1st Workshop on Vector Space Modeling for Natural 2016. Neural machine translation of rare words with Language Processing, pages 151–159. Association subword units. In Proceedings of ACL2016. for Computational Linguistics. Avirup Sil, Gourab Kundu, Radu Florian, and Wael Pranava Swaroop Madhyastha and Cristina España- Hamza. 2017. Neural cross-lingual entity linking. Bonet. 2017. Learning bilingual projections of em- CoRR, abs/1712.01813. beddings for vocabulary expansion in machine trans- lation. In Proceedings of the 2nd Workshop on Rep- Samuel L. Smith, David H. P. Turban, Steven Hamblin, resentation Learning for NLP, pages 139–145, Van- and Nils Y. Hammerla. 2017. Offline bilingual word couver, Canada. Association for Computational Lin- vectors, orthogonal transformations and the inverted guistics. softmax. CoRR, abs/1702.03859.

Benjamin Marie and Atsushi Fujita. 2017. Efficient Fabian M. Suchanek, Gjergji Kasneci, and Gerhard extraction of pseudo-parallel sentences from raw Weikum. 2007. Yago: a core of semantic knowl- monolingual data using word embeddings. In Pro- edge. In Proceedings of the 16th international con- ceedings of the 55th Annual Meeting of the Associa- ference on World Wide Web, pages 697–706. tion for Computational Linguistics (Volume 2: Short Papers), pages 392–398, Vancouver, Canada. Asso- Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wik- ciation for Computational Linguistics. ification using multilingual embeddings. In Pro- ceedings of the 2016 Conference of the North Amer- O. Medelyan and C. Legg. 2008. Integrating cyc and ican Chapter of the Association for Computational wikipedia: Folksonomy meets rigorously defined Linguistics: Human Language Technologies, pages common-sense. In Proc. AAAI 2008 Workshop on 589–598, San Diego, California. Association for Wikipedia and Artificial Intelligence. Computational Linguistics.

65 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 5998–6008. Ivan Vulic,´ Nikola Mrkšic,´ and Anna Korhonen. 2017. Cross-lingual induction and transfer of verb classes based on word vector space specialisation. In Pro- ceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 2546– 2558. Association for Computational Linguistics. Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly em- bedding. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1591–1601. Association for Com- putational Linguistics. Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal trans- form for bilingual word translation. In Proceed- ings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011. Association for Computational Linguis- tics. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the em- bedding of words and entities for named entity dis- ambiguation. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 250–259, Berlin, Germany. Asso- ciation for Computational Linguistics. Baosong Yang, Derek F. Wong, Tong Xiao, Lidia S. Chao, and Jingbo Zhu. 2017. Towards bidirectional hierarchical representations for attention-based neu- ral machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1432–1441, Copenhagen, Denmark. Association for Computational Linguis- tics.

Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1959–1970. Association for Computational Linguis- tics. Will Y. Zou, Richard Socher, Daniel Cer, and Christo- pher D. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. In Proceed- ings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1393–1398. Association for Computational Linguistics.

66