Learning to Represent Bilingual Dictionaries
Muhao Chen1∗, Yingtao Tian2∗, Haochen Chen2, Kai-Wei Chang1, Steven Skiena2 & Carlo Zaniolo1 1University of California, Los Angeles, CA, USA 2The State University of New York, Stony Brook, NY, USA [email protected]; {kwchang, zaniolo}@cs.ucla.edu; {yittian, haocchen, skiena}@cs.stonybrook.edu
Abstract mantic transfers, these approaches critically sup- port many cross-lingual NLP tasks including neu- Bilingual word embeddings have been widely ral machine translations (NMT) (Devlin et al., used to capture the correspondence of lexi- 2014), bilingual document classification (Zhou cal semantics in different human languages. However, the cross-lingual correspondence et al., 2016), knowledge alignment (Chen et al., between sentences and words is less studied, 2018b) and entity linking (Upadhyay et al., 2018). despite that this correspondence can signifi- While many existing approaches have been pro- cantly benefit many applications such as cross- posed to associate lexical semantics between lan- lingual semantic search and textual inference. guages (Chandar et al., 2014; Gouws et al., 2015; To bridge this gap, we propose a neural em- Luong et al., 2015a), modeling the correspon- bedding model that leverages bilingual dictio- dence between lexical and sentential semantics naries1. The proposed model is trained to map the lexical definitions to the cross-lingual tar- across different languages is still an unresolved get words, for which we explore with differ- challenge. We argue that learning to represent ent sentence encoding techniques. To enhance such cross-lingual and multi-granular correspon- the learning process on limited resources, our dence is well desired and natural for multiple rea- model adopts several critical learning strate- sons. One reason is that, learning word-to-word gies, including multi-task learning on differ- correspondence has a natural limitation, consider- ent bridges of languages, and joint learning of ing that many words do not have direct transla- the dictionary model with a bilingual word em- tions in another language. For example, schaden- bedding model. We conduct experiments on two new tasks. In the cross-lingual reverse freude in German, which means a feeling of joy dictionary retrieval task, we demonstrate that that comes from knowing the troubles of other our model is capable of comprehending bilin- people, has no proper English counterpart word. gual concepts based on descriptions, and the To appropriately learn the representations of such proposed learning strategies are effective. In words in bilingual embeddings, we need to capture the bilingual paraphrase identification task, we their meanings based on the definitions. show that our model effectively associates sen- Besides, modeling such correspondence is also tences in different languages via a shared em- bedding space, and outperforms existing ap- highly beneficial to many application scenarios. One example is cross-lingual semantic search of
arXiv:1808.03726v3 [cs.CL] 6 Sep 2019 proaches in identifying bilingual paraphrases. concepts (Hill et al., 2016), where the lexemes 1 Introduction or concepts are retrieved based on sentential de- scriptions (see Fig.1). Others include discourse Cross-lingual semantic representation learning has relation detection in bilingual dialogue utterances attracted significant attention recently. Various ap- (Jiang et al., 2018), multilingual text summariza- proaches have been proposed to align words of tion (Nenkova et al., 2012), and educational ap- different languages in a shared embedding space plications for foreign language learners. Finally, (Ruder et al., 2017). By offering task-invariant se- it is natural in foreign language learning that a ∗ Both authors contributed equally to this work. human learns foreign words by looking up their 1We refer the term dictionary to its common meaning, i.e. meanings in the native language (Hulstijn et al., lexical definitions of words. Note that this is different from some papers on bilingual settings that refer dictionaries to 1996). Therefore, learning such correspondence seed lexicons for one-to-one word mappings. essentially mimics human learning behaviors. ful to help users find foreign words based on the A male descendent in relation to his parents. EN FilsFR Cross-lingual Reverse notions or descriptions, and is especially benefi- Cross-lingual Paraphrases Dictionary Retrieval cial to users such as translators, foreigner language Tout être humain du sexe masculin considéré learners and technical writers using non-native par rapport à son père et à sa mère, ou à un des SonEN deux seulement. FR languages. We show that BilDRL achieves promising results on this task, while bilingual Figure 1: An example illustrating the two cross-lingual multi-task learning and joint learning dramatically tasks. The cross-lingual reverse dictionary retrieval finds cross-lingual target words based on descriptions. In terms enhance the performance. (ii) Bilingual para- of cross-lingual paraphrases, the French sentence (which phrase identification asks whether two sentences means any male being considered in relation to his father and mother, or only one of them) describes the same meaning in different languages essentially express the same as the English sentence, but has much more content details. meaning, which is critical to question answering or dialogue systems that apprehend multilingual utterances (Bannard and Callison-Burch, 2005). This task is challenging, as it requires a model to However, realizing such a representation learn- comprehend cross-lingual paraphrases that are in- ing model is a non-trivial task, inasmuch as it re- consistent in grammar, content details and word quires a comprehensive learning process to effec- orders. BilDRL maps sentences to the lexicon tively compose the semantics of arbitrary-length embedding space. This process reduces the prob- sentences in one language, and associate that with lem to evaluate the similarity of lexicon embed- single words in another language. Consequently, dings, which can be easily solved by a simple clas- this objective also demands high-quality cross- sifier. BilDRL performs well with even a small lingual alignment that bridges between single and amount of data, and significantly outperforms pre- sequences of words. Such alignment information vious approaches. is generally not available in the parallel and seed- lexicon that are utilized by bilingual word embed- 2 Related Work dings (Ruder et al., 2017). To incorporate the representations of bilingual We discuss two lines of relevant work. lexical and sentential semantics, we propose an Bilingual word embeddings. Various approaches approach to capture the mapping from the defini- have been proposed for training bilingual word tions to the corresponding foreign words by lever- embeddings. These approaches span in two fami- aging bilingual dictionaries The proposed model lies: off-line mappings and joint training. BilDRL (Bilingual Dictionary Representation The off-line mapping based approach fixes the Learning) first constructs a word embedding structures of pre-trained monolingual embeddings, space with pre-trained bilingual word embed- and induces bilingual projections based on seed dings. Based on cross-lingual word definitions, lexicons (Mikolov et al., 2013a). Some variants a sentence encoder is trained to realize the map- of this approach improve the quality of projec- ping from literal descriptions to target words in tions by adding constraints such as orthogonality the bilingual word embedding space, for which of transforms, normalization and mean centering we investigate with multiple encoding techniques. of embeddings (Xing et al., 2015; Artetxe et al., To enhance cross-lingual learning on limited re- 2016; Vulic´ et al., 2016). Others adopt canonical sources, BilDRL conducts multi-task learning on correlation analysis to map separate monolingual different directions of a language pair. More- embeddings to a shared embedding space (Faruqui over, BilDRL enforces a joint learning strategy and Dyer, 2014; Doval et al., 2018). of bilingual word embeddings and the sentence Unlike off-line mappings, joint training mod- encoder, which seeks to gradually adjust the em- els simultaneously update word embeddings and bedding space to better suit the representation of cross-lingual alignment. In doing so, such ap- cross-lingual word definitions. proaches generally capture more precise cross- To show the applicability of BilDRL, we con- lingual semantic transfer (Ruder et al., 2017; duct experiments on two useful cross-lingual tasks Upadhyay et al., 2018). While a few such mod- (see Fig.1). (i) Cross-lingual reverse dictionary els still maintain separated embedding spaces for retrieval seeks to retrieve words or concepts given each language (Artetxe et al., 2017), more of them descriptions in another language. This task is use- maintain a unified space for both languages. The cross-lingual semantic transfer by these models is Vlj ) is a cross-lingual definition that describes the captured from parallel corpora with sentential or word wi with a sequence of words in language document-level alignment, using techniques such lj. For example, a French-English dictionary as bilingual bag-of-words distances (BilBOWA) D(Fr, En) could include a French word appetite´ (Gouws et al., 2015), Skip-Gram (Coulmance accompanied by its English definition desire for, et al., 2015) and sparse tensor factorization (Vyas or relish of food or drink. Note that, for a word i and Carpuat, 2016). w , multiple definitions in lj may coexist. Neural sentence modeling. Neural sentence BilDRL is constructed and improved through models seek to capture phrasal or sentential se- three stages, as depicted in Fig.2. A sentence en- mantics from word sequences. They often adopt coder is first used to learn from a bilingual dic- encoding techniques such as recurrent neural en- tionary the association between words and defini- coders (RNN) (Kiros et al., 2015), convolutional tions. Then in a pre-trained bilingual word em- encoders (CNN) (Chen et al., 2018a), and atten- bedding space, multi-task learning is conducted on tive encoders (Rocktaschel¨ et al., 2016) to repre- both directions of a language pair. Lastly, joint sent the composed semantics of a sentence as an learning with word embeddings is enforced to si- embedding vector. Recent works have focused multaneously adjust the embedding space during on apprehending pairwise correspondence of sen- the training of the dictionary model, which further tential semantics by adopting multiple neural sen- enhances the cross-lingual learning process. tence models in one learning architecture, includ- It is noteworthy that, NMT (Wu et al., 2016) ing Siamese models for detecting discourse rela- is considered as an ostensibly relevant method to tions of sentences (Sha et al., 2016), and sequence- ours. NMT does not apply to our problem setting to-sequence models for tasks like style transfer bacause it has major differences from our work in (Shen et al., 2017), text summarization (Chopra those perspectives: (i) In terms of data modali- et al., 2016) and translation (Wu et al., 2016). ties, NMT has to bridge between corpora of the On the other hand, fewer efforts have been put same granularity, i.e. either between sentences to characterizing the associations between senten- or between lexicons. This is unlike BilDRL that tial and lexical semantics. Hill et al.(2016) and captures multi-granular correspondence of seman- Ji et al.(2017) learn off-line mappings between tics across different modalities, i.e. sentences and monolingual descriptions and lexemes to capture words; (ii) As for learning strategies, NMT relies such associations. Eisner et al.(2016) adopt a sim- on an encoder-decoder architecture using end-to- ilar approach to capture emojis based on descrip- end training (Luong et al., 2015b), while BilDRL tions. At the best of our knowledge, there has been employs joint learning of a dictionary-based sen- no previous approach to learn to discover the cor- tence encoder and a bilingual embedding space. respondence of sentential and lexical semantics in a multilingual scenario. This is exactly the focus 3.1 Encoders for Lexical Definitions of our work, in which the proposed strategies of BilDRL models a dictionary using a neural sen- multi-task learning and joint learning are critical tence encoder E(S), which composes the mean- to the corresponding learning process under lim- ing of the sentence into a latent vector representa- ited resources. Utilizing such correspondence, our tion. We hereby introduce this model component, approach also sheds light on addressing discourse which is designed to be a GRU encoder with self- relation detection in a multilingual scenario. attention. Besides that, we also investigate other widely-used neural sequence encoders. 3 Modeling Bilingual Dictionaries 3.1.1 Attentive GRU Encoder We hereby begin our modeling with the formaliza- tion of bilingual dictionaries. We use L to denote The GRU encoder is a computationally efficient the set of languages. For a language l ∈ L, Vl de- alternative of the LSTM (Cho et al., 2014). Each notes its vocabulary, where for each word w ∈ Vl, unit consists of a reset gate rt and an update gate zt k bold-faced w ∈ R denotes its embedding vec- to track the state of the sequence. Given the vector tor. A li-lj bilingual dictionary D(li, lj) (or simply representation wt of an incoming item wt, GRU i j (1) Dij) contains dictionary entries (w ,Sw) ∈ Dij, updates the hidden state ht as a linear combina- i j j j j (1) in which w ∈ Vli , and Sw = w1 . . . wn (w· ∈ tion of the previous state ht−1 and the candidate MTL Dictionary Model Bilingual BilBOWA Model English/French Monolingual Skip-Gram Model !+, # "# ✓ Neighbor English Description * * * English Sentence English Word English ✘ Other English '( * !"#,%& Ω "#,%& $ !+, ✓ Neighbor ^ ^ ^ %& French Target Sentence French French Word French ✘ Other French * a male descendent in ^ fils * the cat sat on the mat * mat # the cat sat on the relation to his parents ^ le chat s’est assis sur le tapis ^ tapis $ le chat s’est assis sur le
'( +, +, 1 Joint Objective Function = !"#,%& + ./ !"# + !%& + .0Ω"#,%& Legend The same color indicates shared parameters. Embedding Encoder Average Classifier Output
Figure 2: Joint learning architecture of BilDRL.
˜(1) state ht of the new item wt as below. By measuring the similarity of each ut with uS, the normalized attention weight at, which high- (1) ˜(1) (1) ht = zt ht + (1 − zt) ht−1 lights an input that contributes significantly to the overall meaning, is produced through a softmax. The update gate zt balances between the infor- Note that a scalar |S| is multiplied along with at mation of the previous sequence and the new item, to ut, so as to keep the weighted representation (2) (1) where Mz and Nz are two weight matrices, bz is ht from losing the scale of ht . The sentence a bias vector, and σ is the sigmoid function. encoding is calculated as the average of the last (1) (1) 1 P|S| (2) zt = σ Mzxt + Nzht−1 + bz attention layer E (S) = |S| t=1 atht . ˜(1) The candidate state ht is calculated similarly 3.1.2 Other Encoders to those in a traditional recurrent unit as below. We also experiment with other widely used neural The reset gate rt thereof controls how much infor- sentence modeling techniques2, which are how- mation of the past sequence should contribute to ever outperformed by the attentive GRU in our the candidate state. tasks. These techniques include the vanilla GRU, CNN (Kalchbrenner et al., 2014), and linear bag- ˜(1) (1) ht = tanh Mswt + rt (Nsht−1) + bs of-words (BOW) (Hill et al., 2016). We briefly in- (1) troduce the later two techniques in the following. rt = σ Mrwt + Nrh + br t−1 A convolutional encoder applies a kernel Mc ∈ h×k to produce the latent representation h(3) = While a GRU encoder can stack multiple of the R t tanh(M w + b ) from each h-gram of the above GRU layers, without an attention mecha- c t:t+h−1 c input vector sequence w , for which h is the nism, the last state h(1) of the last layer represents t:t+h−1 S kernel size and b is a bias vector. A sequence of the overall meaning of the encoded sentence S. c latent vectors H(3) = [h(3), h(3), ..., h(3) ] is The self-attention mechanism (Conneau et al., 1 2 |S|−h+1 produced from the input, where each latent vector 2017) seeks to highlight the important units in an leverages the significant local semantic features input sentence when capturing its overall meaning, from each h-gram. Following convention (Liu which is calculated as below: et al., 2017), we apply dynamic max-pooling to (1) extract robust features from the convolution out- ut = tanh Mah + ba t puts, and use the mean-pooling results of the last > exp u uS layer to represent the sentential semantics. a = t t P exp (u> u ) The Linear bag-of-words (BOW) encoder (Ji wm∈S m S (2) et al., 2017; Hill et al., 2016) is realized by the ht = |S|atut 2Note that recent advances in monolingual contextualized u is the intermediary representation of GRU out- embeddings like multilingual ELMo (Peters et al., 2018; Che t et al., 2018) and M-BERT (Pires et al., 2019; Devlin et al., (1) (1) put ht , and uS = tanh(MahS + ba) is that 2018) can also be supported to represent sentences for our (1) setting. We leave them as future work, as they require non- of the last GRU output hS . uS can be seen as trivial adaption to both multilingual settings and joint train- a high-level representation of the input sequence. ing, and extensive pre-training on external corpora. sum of projected word embeddings of the input simple multi-task strategy to bring significant im- (2) P|S| sentence, i.e. E (S) = t=1 Mbwt. provement to our cross-lingual tasks. Note that, besides BilBOWA, other joint-training bilingual 3.2 Basic Learning Objective embeddings in a unified space (Doval et al., 2018) The objective of learning the dictionary model is can also support this strategy, for which we leave to map the encodings of cross-lingual word defini- the comparison to future work. tions to the target word embeddings. This is real- 3.4 Joint Learning Objective ized by minimizing the following L2 loss, While above learning strategies are based on a ST 1 X j i 2 fixed embedding space, we lastly propose a joint Lij = Eij(Sw) − w 2 |Dij| learning strategy. During the training process, i j (w ,Sw)∈Dij this strategy simultaneously updates the embed- in which Eij is the dictionary model that maps ding space based on both the dictionary model and from descriptions in lj to words in li. the bilingual word embedding model. The learn- The above defines the basic model variants ing is through asynchronous minimization of the of BilDRL that learns on a single dictionary. following joint objective function, For word representations in the learning process, MT SG SG A BilDRL initializes the embedding space using J = Lij + λ1(Li + Lj ) + λ2Ωij pre-trained word embeddings. Note that, without adopting the joint learning strategy in Section 3.4, where λ1 and λ2 are two positive hyperparameters. SG SG the learning process does not update word embed- Li and Lj are the original Skip-Gram losses dings that are used to represent the definitions and (Mikolov et al., 2013b) to separately obtain word target words. While other forms of loss such as embeddings on monolingual corpora of li and lj. A cosine proximity (Hill et al., 2016) and hinge loss Ωij, termed as below, is the alignment loss to min- (Ji et al., 2017) may also be used in the learning imize bag-of-words distances for aligned sentence i j process, we find that L2 loss consistently leads to pairs (S ,S ) in parallel corpora Cij. better performance in our experiments. 1 X ΩA = d (Si,Sj) ij |C | S 3.3 Bilingual Multi-task Learning ij i j (S ,S )∈Cij In cases where entries in a bilingual dictionary 2 are not amply provided, learning the above bilin- i j 1 X i 1 X j dS(S ,S ) = w − w |Si| m |Sj| n gual dictionary on one ordered language pair may i i j wm∈S w ∈Sj fall short in insufficiency of alignment informa- n 2 tion. One practical solution is to conduct a bilin- The joint learning process adapts the embed- gual multi-task learning process. In detail, given ding space to better suit the dictionary model, which is shown to further enhance the cross- a language pair (li, lj), we learn the dictionary lingual learning of BilDRL. model Eij on both dictionaries Dij and Dji with shared parameters. Correspondingly, we rewrite 3.5 Training the previous learning objective function as below, To initialize the embedding space, we pre- in which D = Dij ∪ Dji. trained BilBOWA on the parallel corpora Eu- 1 X roparl v7 (Koehn, 2005) and monolingual cor- LMT = kE (S ) − wk2 ij |D| ij w 2 pora of tokenized Wikipedia dump (Al-Rfou et al., (w,Sw)∈D 2013). For models without joint learning, we use This strategy non-trivially requests the same AMSGrad (Reddi et al., 2018) to optimize the pa- dictionary model to represent semantic transfer in rameters. Each model without bilingual multi-task two directions of the language pair. To fulfill such learning thereof, is trained on batched samples a request, we initialize the embedding space using from each individual dictionary. Multi-task learn- the BilBOWA embeddings (Gouws et al., 2015), ing models are trained on batched samples from which provide a unified embedding space that re- two dictionaries. Within each batch, entries of solves both monolingual and cross-lingual seman- different directions of languages can be mixed to- tic relatedness of words. In practice, we find this gether. For joint learning, we conduct an efficient multi-threaded asynchronous training (Mnih et al., Dictionary En-Fr Fr-En En-Es Es-En #Target words 15,666 16,857 8,004 16,986 2016) of AMSGrad. In detail, after initializing the #Definitions 50,412 58,808 20,930 56,610 embedding space based on pre-trained BilBOWA, parameter updating based on the four components Table 1: Statistics of the bilingual dictionary dataset Wikt3l. of J occurs across four worker threads. Two monolingual threads select batches of monolin- Positive Examples En: Being remote in space. gual contexts from the Wikipedia dump of two lan- Fr: Se trouvant a` une grande distance. guages for Skip-Gram, one alignment thread ran- En: The interdisciplinary science that applies theories and domly samples parallel sentences from Europarl, methods of the physical sciences to questions of biology. Es: Ciencia que emplea y desarrolla las teorias y metodos´ de and one dictionary thread extracts samples of en- la f´ısica en la investigacion´ de los sistemas biolgicos. tries for a bilingual multi-task dictionary model. Negative Examples Each thread makes a batched update to model pa- En: A person who secedes or supports secession from a political union. rameters asynchronously for each term of J. The Fr: Controleˆ politique exerce´ par une grande puissance sur asynchronous training of all threads keeps going une contre infeod´ ee.´ until the dictionary thread finishes its epochs. En: The fear of closed, tight places. Es: Perdida´ o disminucion´ considerables de la memoria.
4 Experiments Table 2: Examples of bilingual paraphrases from WBP3l. We present experiments on two multilingual tasks: the cross-lingual reverse dictionary retrieval task Wikt3l for training BilDRL. To generate negative and the bilingual paraphrase identification task. examples, given a source word, we first find its 15 nearest neighbors in the embedding space. 4.1 Datasets Within the nearest neighbors, we use ConceptNet The experiment of cross-lingual reverse dictio- (Speer et al., 2017) to filter out the synonyms of nary retrieval is conducted on a trilingual dataset the source word, so as to prevent from generating Wikt3l. This dataset is extracted from Wik- false negative cases. Then we randomly pick tionary3, which is one of the largest freely avail- one word from the filtered neighbors and pair able multilingual dictionary resources on the Web. its cross-lingual definition with the English Wikt3l contains dictionary entries of language definition of the source word to create a negative pairs (English, French) and (English, Spanish), case. This process ensures that each negative which form En-Fr, Fr-En, En-Es and Es-En dic- case is endowed with limited dissimilarity of tionaries on four bridges of languages in total. sentence meanings, which makes the decision Two types of cross-lingual definitions are ex- more challenging. For each language setting, tracted from Wiktionary: (i) cross-lingual defini- we randomly select 70% for training, 5% for tions provided under the Translations sections of validation, and the rest 25% for testing. Note Wiktionary pages; (ii) monolingual definitions for that each language setting of this dataset thereof, words that are linked to a cross-lingual counterpart matches with the quantity and partitioning of with a inter-language link4 of Wiktionary. We ex- sentence pairs in the widely-used Microsoft clude all the definitions of stop words in construct- Research Paraphrase Corpus benchmark for ing the dataset, and list the statistics in Table1. monolingual paraphrase identification (Yin et al., Since existing datasets for paraphrase identi- 2016; Das and Smith, 2009). Several examples fication are merely monolingual, we contribute from the dataset are shown in Table2. The with another dataset WBP3l for cross-lingual datasets and the processing scripts are available sentential paraphrase identification. This dataset at https://github.com/muhaochen/ contains 6,000 pairs of bilingual sentence pairs bilingual_dictionaries. respectively for En-Fr and En-Es settings. Within each bilingual setting, positive cases are formed 4.2 Cross-lingual Reverse Dictionary as pairs of descriptions aligned by inter-language Retrieval links, which exclude the word descriptions in The objective of this task is to enable cross-lingual semantic retrieval of words based on descriptions. 3https://www.wiktionary.org/ 4An inter-language link matches the entries of counterpart Besides comparing variants of BilDRL that adopt words between language versions of Wiktionary. different sentence encoders and learning strate- Languages En-Fr Fr-En En-Es Es-En Metric P@1 P@10 MRR P@1 P@10 MRR P@1 P@10 MRR P@1 P@10 MRR BOW 0.8 3.4 0.011 0.4 2.2 0.006 0.4 2.4 0.007 0.4 2.6 0.007 CNN 6.0 12.4 0.070 6.4 14.8 0.072 3.8 7.2 0.045 7.0 16.8 0.088 GRU 35.6 46.0 0.380 38.8 49.8 0.410 47.8 59.0 0.496 57.6 67.2 0.604 ATT 38.8 47.4 0.411 39.8 50.2 0.425 51.6 59.2 0.534 60.4 68.4 0.629 GRU-mono 21.8 33.2 0.242 27.8 37.0 0.297 34.4 41.2 0.358 36.8 47.2 0.392 ATT-mono 22.8 33.6 0.249 27.4 39.0 0.298 34.6 42.2 0.358 39.4 48.6 0.414 GRU-MTL 43.4 49.2 0.452 44.4 52.8 0.467 50.4 60.0 0.530 63.6 71.8 0.659 ATT-MTL 46.8 56.6 0.487 47.6 56.6 0.497 55.8 62.2 0.575 66.4 75.0 0.687 ATT-joint 63.6 69.4 0.654 68.2 75.4 0.706 69.0 72.8 0.704 78.6 83.4 0.803
Table 3: Cross-lingual reverse dictionary retrieval results by BilDRL variants. We report P@1, P@10, and MRR on four groups of models: (i) basic dictionary models that adopt four different encoding techniques (BOW, CNN, GRU and ATT); (ii) models with the two best encoding techniques that enforce the monolingual retrieval approach by Hill et al.(2016) (GRU- mono and ATT-mono); (iii) models adopting bilingual multi-task learning (GRU-MTL and ATT-MTL); (iv) joint learning that employs the best dictionary model of ATT-MTL (ATT-joint). gies, we also compare with the monolingual re- ranks no larger than 10 P@10 (%), and mean re- trieval approach proposed by Hill et al.(2016). ciprocal rank MRR. Instead of directly associating cross-lingual word We pre-train BilBOWA based on the original definitions, this approach learns definition-to- configuration by Gouws et al.(2015) and obtain word mappings in a monolingual scenario. When 50-dimensional initialization of bilingual word it applies to the multilingual setting, given a lexical embedding spaces respectively for the English- definition, it first retrieves the corresponding word French and English-Spanish settings. For CNN, in the source language. Then, it looks for seman- GRU, and attentive GRU (ATT) encoders, we tically related words in the target language using stack five of each corresponding encoding layers bilingual word embeddings. As discussed in Sec- with hidden-sizes of 200, and two affine layers are tion3, NMT does not apply to this task due that it applied to the final output for dimension reduction. cannot capture the multi-granular correspondence This encoder architecture consistently represents between a sentence and a word. the best performance through our tuning. Through Evaluation Protocol. Before training the mod- comprehensive hyperparameter tuning, we fix the els, we randomly select 500 word definitions from learning rate α to 0.0005, the exponential decay each dictionary respectively as test cases, and rates of AMSGrad β1 and β2 to 0.9 and 0.999, co- exclude these definitions from the training data. efficients λ1 and λ2 to both 0.1, and batch size Each of the basic BilDRL variants are trained to 64. Kernel-size and pooling-size are both set on one bilingual dictionary. The monolingual re- to 2 for CNN. Word definitions are zero-padded trieval models are trained to fit the target words (short ones) or truncated (long ones) to the se- in the original languages of the word definitions, quence length of 15, since most definitions (over which are also provided in Wiktionary. BilDRL 92%) are within 15 words in the dataset. Training variants with multi-task or joint learning use both is limited to 1,000 epochs for all models as well as dictionaries of the same language pair. In the test the dictionary thread of asynchronous joint learn- i j phase, for each test case (w ,Sw) ∈ Dij, the ing, in which all models are able to converge. prediction performs a kNN search from the defi- Results. Results are reported in Table3 in four j nition encoding Eij(Sw), and record the rank of groups. The first group compares four different i w within the vocabulary of li. We limit the vo- encoding techniques for the basic dictionary mod- cabularies to all words that appear in the Wikt3l els. GRU thereof consistently outperforms CNN dataset, which involve around 45k English words, and BOW, since the latter two fail to capture the 44k French words and 36k Spanish words. To pre- important sequential information for descriptions. vent the surface information of the target word ATT that weighs among the hidden states has no- from appearing in the definition, we have also table improvements over GRU. While we equip masked out any translation of the target word oc- the two better encoding techniques with the mono- curring in the definition using a wildcard token lingual retrieval approach (GRU-mono and ATT-
Felix Hill, KyungHyun Cho, Anna Korhonen, and Thang Luong, Hieu Pham, and Christopher D Man- Yoshua Bengio. 2016. Learning to understand ning. 2015b. Effective approaches to attention- phrases by embedding the dictionary. Transactions based neural machine translation. In Proceedings of of the Association for Computational Linguistics, the 2015 Conference on Empirical Methods in Nat- 4:17–30. ural Language Processing, pages 1412–1421. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Exploiting similarities among languages for ma- Chen. 2014. Convolutional neural network architec- chine translation. arXiv preprint arXiv:1309.4168. tures for matching natural language sentences. In Advances in neural information processing systems, Tomas Mikolov, Ilya Sutskever, et al. 2013b. Dis- pages 2042–2050. tributed representations of words and phrases and their compositionality. In Advances in Neural In- Jan H Hulstijn, Merel Hollander, and Tine Grei- formation Processing Systems. danus. 1996. Incidental vocabulary learning by ad- vanced foreign language students: The influence of Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi marginal glosses, dictionary use, and reoccurrence Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, of unknown words. The modern language journal, David Silver, and Koray Kavukcuoglu. 2016. Asyn- 80(3):327–339. chronous methods for deep reinforcement learning. In International Conference on Machine Learning, Yogarshi Vyas and Marine Carpuat. 2016. Sparse pages 1928–1937. bilingual word representations for cross-lingual lex- ical entailment. In Proceedings of the 2016 Confer- Jonas Mueller and Aditya Thyagarajan. 2016. Siamese ence of the North American Chapter of the Associ- recurrent architectures for learning sentence similar- ation for Computational Linguistics: Human Lan- ity. In Proceedings of the AAAI Conference on Arti- guage Technologies, pages 1187–1197. ficial Intelligence, pages 2786–2792. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Ani Nenkova, Kathleen McKeown, et al. 2012. A sur- Le, Mohammad Norouzi, Wolfgang Macherey, vey of text summarization techniques. In Mining Maxim Krikun, Yuan Cao, Qin Gao, Klaus text data, pages 43–76. Springer. Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between Matthew Peters, Mark Neumann, Mohit Iyyer, Matt human and machine translation. arXiv preprint Gardner, Christopher Clark, Kenton Lee, and Luke arXiv:1609.08144. Zettlemoyer. 2018. Deep contextualized word repre- sentations. In Proceedings of the 2018 Conference Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. of the North American Chapter of the Association Normalized word embedding and orthogonal trans- for Computational Linguistics: Human Language form for bilingual word translation. In Proceed- Technologies, Volume 1 (Long Papers), volume 1, ings of the 2015 Conference of the North Ameri- pages 2227–2237. can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. 1006–1011. How multilingual is multilingual bert? In ACL. Wenpeng Yin and Hinrich Schutze.¨ 2015. Convolu- Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. tional neural network for paraphrase identification. 2018. On the convergence of adam and beyond. In In Proceedings of the 2015 Conference of the North International Conference on Learning Representa- American Chapter of the Association for Computa- tions. tional Linguistics: Human Language Technologies.
Tim Rocktaschel,¨ Edward Grefenstette, Karl Moritz Wenpeng Yin, Hinrich Schutze,¨ et al. 2016. Abcnn: Hermann, Toma´sˇ Kociskˇ y,` and Phil Blunsom. 2016. Attention-based convolutional neural network for Reasoning about entailment with neural attention. modeling sentence pairs. Transactions of the Asso- In International Conference on Learning Represen- ciation for Computational Linguistics, 4(1). tations. Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Sebastian Ruder, Ivan Vulic,´ and Anders Søgaard. Cross-lingual sentiment classification with bilingual 2017. A survey of cross-lingual word embedding document representation learning. In Proceedings models. Journal of Artificial Intelligence Research. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Lei Sha, Baobao Chang, et al. 2016. Reading and pers), volume 1, pages 1403–1412. thinking: Re-read lstm unit for textual entailment recognition. In Proceedings of the International Conference on Computational Linguistics.
Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Informa- tion Processing Systems, pages 6833–6844.
Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Confer- ence on Artificial Intelligence.
Shyam Upadhyay, Nitish Gupta, and Dan Roth. 2018. Joint multilingual supervision for cross-lingual en- tity linking. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2486–2495.
Ivan Vulic,´ Anna Korhonen, et al. 2016. On the role of seed lexicons in learning bilingual word embed- dings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), volume 1, pages 247–257.