<<

Learning to Represent Bilingual

Muhao Chen1∗, Yingtao Tian2∗, Haochen Chen2, Kai-Wei Chang1, Steven Skiena2 & Carlo Zaniolo1 1University of California, Los Angeles, CA, USA 2The State University of New York, Stony Brook, NY, USA [email protected]; {kwchang, zaniolo}@cs.ucla.edu; {yittian, haocchen, skiena}@cs.stonybrook.edu

Abstract mantic transfers, these approaches critically sup- port many cross-lingual NLP tasks including neu- Bilingual embeddings have been widely ral machine translations (NMT) (Devlin et al., used to capture the correspondence of lexi- 2014), bilingual document classification (Zhou cal semantics in different human languages. However, the cross-lingual correspondence et al., 2016), knowledge alignment (Chen et al., between sentences and is less studied, 2018b) and entity linking (Upadhyay et al., 2018). despite that this correspondence can signifi- While many existing approaches have been pro- cantly benefit many applications such as cross- posed to associate between lan- lingual semantic search and textual inference. guages (Chandar et al., 2014; Gouws et al., 2015; To bridge this gap, we propose a neural em- Luong et al., 2015a), modeling the correspon- bedding model that leverages bilingual dictio- dence between lexical and sentential semantics naries1. The proposed model is trained to map the lexical definitions to the cross-lingual tar- across different languages is still an unresolved get words, for which we explore with differ- challenge. We argue that learning to represent ent sentence encoding techniques. To enhance such cross-lingual and multi-granular correspon- the learning process on limited resources, our dence is well desired and natural for multiple rea- model adopts several critical learning strate- sons. One reason is that, learning word-to-word gies, including multi-task learning on differ- correspondence has a natural limitation, consider- ent bridges of languages, and joint learning of ing that many words do not have direct transla- the model with a bilingual word em- tions in another language. For example, schaden- bedding model. We conduct experiments on two new tasks. In the cross-lingual reverse freude in German, which means a feeling of joy dictionary retrieval task, we demonstrate that that comes from knowing the troubles of other our model is capable of comprehending bilin- people, has no proper English counterpart word. gual concepts based on descriptions, and the To appropriately learn the representations of such proposed learning strategies are effective. In words in bilingual embeddings, we need to capture the bilingual paraphrase identification task, we their meanings based on the definitions. show that our model effectively associates sen- Besides, modeling such correspondence is also tences in different languages via a shared em- bedding space, and outperforms existing ap- highly beneficial to many application scenarios. One example is cross-lingual semantic search of

arXiv:1808.03726v3 [cs.CL] 6 Sep 2019 proaches in identifying bilingual paraphrases. concepts (Hill et al., 2016), where the 1 Introduction or concepts are retrieved based on sentential de- scriptions (see Fig.1). Others include discourse Cross-lingual semantic representation learning has relation detection in bilingual dialogue utterances attracted significant attention recently. Various ap- (Jiang et al., 2018), multilingual text summariza- proaches have been proposed to align words of tion (Nenkova et al., 2012), and educational ap- different languages in a shared embedding space plications for foreign language learners. Finally, (Ruder et al., 2017). By offering task-invariant se- it is natural in foreign language learning that a ∗ Both authors contributed equally to this work. human learns foreign words by looking up their 1We refer the term dictionary to its common meaning, i.e. meanings in the native language (Hulstijn et al., lexical definitions of words. Note that this is different from some papers on bilingual settings that refer dictionaries to 1996). Therefore, learning such correspondence seed for one-to-one word mappings. essentially mimics human learning behaviors. ful to help users find foreign words based on the A male descendent in relation to his parents. EN FilsFR Cross-lingual Reverse notions or descriptions, and is especially benefi- Cross-lingual Paraphrases Dictionary Retrieval cial to users such as translators, foreigner language Tout être humain du sexe masculin considéré learners and technical writers using non-native par rapport à son père et à sa mère, ou à un des SonEN deux seulement. FR languages. We show that BilDRL achieves promising results on this task, while bilingual Figure 1: An example illustrating the two cross-lingual multi-task learning and joint learning dramatically tasks. The cross-lingual reverse dictionary retrieval finds cross-lingual target words based on descriptions. In terms enhance the performance. (ii) Bilingual para- of cross-lingual paraphrases, the French sentence (which phrase identification asks whether two sentences means any male being considered in relation to his father and mother, or only one of them) describes the same meaning in different languages essentially express the same as the English sentence, but has much more content details. meaning, which is critical to question answering or dialogue systems that apprehend multilingual utterances (Bannard and Callison-Burch, 2005). This task is challenging, as it requires a model to However, realizing such a representation learn- comprehend cross-lingual paraphrases that are in- ing model is a non-trivial task, inasmuch as it re- consistent in grammar, content details and word quires a comprehensive learning process to effec- orders. BilDRL maps sentences to the tively compose the semantics of arbitrary-length embedding space. This process reduces the prob- sentences in one language, and associate that with lem to evaluate the similarity of lexicon embed- single words in another language. Consequently, dings, which can be easily solved by a simple clas- this objective also demands high-quality cross- sifier. BilDRL performs well with even a small lingual alignment that bridges between single and amount of data, and significantly outperforms pre- sequences of words. Such alignment information vious approaches. is generally not available in the parallel and seed- lexicon that are utilized by bilingual word embed- 2 Related Work dings (Ruder et al., 2017). To incorporate the representations of bilingual We discuss two lines of relevant work. lexical and sentential semantics, we propose an Bilingual word embeddings. Various approaches approach to capture the mapping from the defini- have been proposed for training bilingual word tions to the corresponding foreign words by lever- embeddings. These approaches span in two fami- aging bilingual dictionaries The proposed model lies: off-line mappings and joint training. BilDRL ( Representation The off-line mapping based approach fixes the Learning) first constructs a word embedding structures of pre-trained monolingual embeddings, space with pre-trained bilingual word embed- and induces bilingual projections based on seed dings. Based on cross-lingual word definitions, lexicons (Mikolov et al., 2013a). Some variants a sentence encoder is trained to realize the map- of this approach improve the quality of projec- ping from literal descriptions to target words in tions by adding constraints such as orthogonality the bilingual word embedding space, for which of transforms, normalization and mean centering we investigate with multiple encoding techniques. of embeddings (Xing et al., 2015; Artetxe et al., To enhance cross-lingual learning on limited re- 2016; Vulic´ et al., 2016). Others adopt canonical sources, BilDRL conducts multi-task learning on correlation analysis to map separate monolingual different directions of a language pair. More- embeddings to a shared embedding space (Faruqui over, BilDRL enforces a joint learning strategy and Dyer, 2014; Doval et al., 2018). of bilingual word embeddings and the sentence Unlike off-line mappings, joint training mod- encoder, which seeks to gradually adjust the em- els simultaneously update word embeddings and bedding space to better suit the representation of cross-lingual alignment. In doing so, such ap- cross-lingual word definitions. proaches generally capture more precise cross- To show the applicability of BilDRL, we con- lingual semantic transfer (Ruder et al., 2017; duct experiments on two useful cross-lingual tasks Upadhyay et al., 2018). While a few such mod- (see Fig.1). (i) Cross-lingual reverse dictionary els still maintain separated embedding spaces for retrieval seeks to retrieve words or concepts given each language (Artetxe et al., 2017), more of them descriptions in another language. This task is use- maintain a unified space for both languages. The cross-lingual semantic transfer by these models is Vlj ) is a cross-lingual definition that describes the captured from parallel corpora with sentential or word wi with a sequence of words in language document-level alignment, using techniques such lj. For example, a French-English dictionary as bilingual bag-of-words distances (BilBOWA) D(Fr, En) could include a French word appetite´ (Gouws et al., 2015), Skip-Gram (Coulmance accompanied by its English definition desire for, et al., 2015) and sparse tensor factorization (Vyas or relish of food or drink. Note that, for a word i and Carpuat, 2016). w , multiple definitions in lj may coexist. Neural sentence modeling. Neural sentence BilDRL is constructed and improved through models seek to capture phrasal or sentential se- three stages, as depicted in Fig.2. A sentence en- mantics from word sequences. They often adopt coder is first used to learn from a bilingual dic- encoding techniques such as recurrent neural en- tionary the association between words and defini- coders (RNN) (Kiros et al., 2015), convolutional tions. Then in a pre-trained bilingual word em- encoders (CNN) (Chen et al., 2018a), and atten- bedding space, multi-task learning is conducted on tive encoders (Rocktaschel¨ et al., 2016) to repre- both directions of a language pair. Lastly, joint sent the composed semantics of a sentence as an learning with word embeddings is enforced to si- embedding vector. Recent works have focused multaneously adjust the embedding space during on apprehending pairwise correspondence of sen- the training of the dictionary model, which further tential semantics by adopting multiple neural sen- enhances the cross-lingual learning process. tence models in one learning architecture, includ- It is noteworthy that, NMT (Wu et al., 2016) ing Siamese models for detecting discourse rela- is considered as an ostensibly relevant method to tions of sentences (Sha et al., 2016), and sequence- ours. NMT does not apply to our problem setting to-sequence models for tasks like style transfer bacause it has major differences from our work in (Shen et al., 2017), text summarization (Chopra those perspectives: (i) In terms of data modali- et al., 2016) and translation (Wu et al., 2016). ties, NMT has to bridge between corpora of the On the other hand, fewer efforts have been put same granularity, i.e. either between sentences to characterizing the associations between senten- or between lexicons. This is unlike BilDRL that tial and lexical semantics. Hill et al.(2016) and captures multi-granular correspondence of seman- Ji et al.(2017) learn off-line mappings between tics across different modalities, i.e. sentences and monolingual descriptions and lexemes to capture words; (ii) As for learning strategies, NMT relies such associations. Eisner et al.(2016) adopt a sim- on an encoder-decoder architecture using end-to- ilar approach to capture emojis based on descrip- end training (Luong et al., 2015b), while BilDRL tions. At the best of our knowledge, there has been employs joint learning of a dictionary-based sen- no previous approach to learn to discover the cor- tence encoder and a bilingual embedding space. respondence of sentential and lexical semantics in a multilingual scenario. This is exactly the focus 3.1 Encoders for Lexical Definitions of our work, in which the proposed strategies of BilDRL models a dictionary using a neural sen- multi-task learning and joint learning are critical tence encoder E(S), which composes the mean- to the corresponding learning process under lim- ing of the sentence into a latent vector representa- ited resources. Utilizing such correspondence, our tion. We hereby introduce this model component, approach also sheds light on addressing discourse which is designed to be a GRU encoder with self- relation detection in a multilingual scenario. attention. Besides that, we also investigate other widely-used neural sequence encoders. 3 Modeling Bilingual Dictionaries 3.1.1 Attentive GRU Encoder We hereby begin our modeling with the formaliza- tion of bilingual dictionaries. We use L to denote The GRU encoder is a computationally efficient the set of languages. For a language l ∈ L, Vl de- alternative of the LSTM (Cho et al., 2014). Each notes its , where for each word w ∈ Vl, unit consists of a reset gate rt and an update gate zt k bold-faced w ∈ R denotes its embedding vec- to track the state of the sequence. Given the vector tor. A li-lj bilingual dictionary D(li, lj) (or simply representation wt of an incoming item wt, GRU i j (1) Dij) contains dictionary entries (w ,Sw) ∈ Dij, updates the hidden state ht as a linear combina- i j j j j (1) in which w ∈ Vli , and Sw = w1 . . . wn (w· ∈ tion of the previous state ht−1 and the candidate MTL Dictionary Model Bilingual BilBOWA Model English/French Monolingual Skip-Gram Model !+, # "# ✓ Neighbor English Description * * * English Sentence English Word English ✘ Other English '( * !"#,%& Ω "#,%& $ !+, ✓ Neighbor ^ ^ ^ %& French Target Sentence French French Word French ✘ Other French * a male descendent in ^ fils * the cat sat on the mat * mat # the cat sat on the relation to his parents ^ le chat s’est assis sur le tapis ^ tapis $ le chat s’est assis sur le

'( +, +, 1 Joint Objective Function = !"#,%& + ./ !"# + !%& + .0Ω"#,%& Legend The same color indicates shared parameters. Embedding Encoder Average Classifier Output

Figure 2: Joint learning architecture of BilDRL.

˜(1) state ht of the new item wt as below. By measuring the similarity of each ut with uS, the normalized attention weight at, which high- (1) ˜(1) (1) ht = zt ht + (1 − zt) ht−1 lights an input that contributes significantly to the overall meaning, is produced through a softmax. The update gate zt balances between the infor- Note that a scalar |S| is multiplied along with at mation of the previous sequence and the new item, to ut, so as to keep the weighted representation (2) (1) where Mz and Nz are two weight matrices, bz is ht from losing the scale of ht . The sentence a bias vector, and σ is the sigmoid function. encoding is calculated as the average of the last  (1)  (1) 1 P|S| (2) zt = σ Mzxt + Nzht−1 + bz attention layer E (S) = |S| t=1 atht . ˜(1) The candidate state ht is calculated similarly 3.1.2 Other Encoders to those in a traditional recurrent unit as below. We also experiment with other widely used neural The reset gate rt thereof controls how much infor- sentence modeling techniques2, which are how- mation of the past sequence should contribute to ever outperformed by the attentive GRU in our the candidate state. tasks. These techniques include the vanilla GRU, CNN (Kalchbrenner et al., 2014), and linear bag- ˜(1)  (1)  ht = tanh Mswt + rt (Nsht−1) + bs of-words (BOW) (Hill et al., 2016). We briefly in-  (1)  troduce the later two techniques in the following. rt = σ Mrwt + Nrh + br t−1 A convolutional encoder applies a kernel Mc ∈ h×k to produce the latent representation h(3) = While a GRU encoder can stack multiple of the R t tanh(M w + b ) from each h-gram of the above GRU layers, without an attention mecha- c t:t+h−1 c input vector sequence w , for which h is the nism, the last state h(1) of the last layer represents t:t+h−1 S kernel size and b is a bias vector. A sequence of the overall meaning of the encoded sentence S. c latent vectors H(3) = [h(3), h(3), ..., h(3) ] is The self-attention mechanism (Conneau et al., 1 2 |S|−h+1 produced from the input, where each latent vector 2017) seeks to highlight the important units in an leverages the significant local semantic features input sentence when capturing its overall meaning, from each h-gram. Following convention (Liu which is calculated as below: et al., 2017), we apply dynamic max-pooling to  (1)  extract robust features from the convolution out- ut = tanh Mah + ba t puts, and use the mean-pooling results of the last >  exp u uS layer to represent the sentential semantics. a = t t P exp (u> u ) The Linear bag-of-words (BOW) encoder (Ji wm∈S m S (2) et al., 2017; Hill et al., 2016) is realized by the ht = |S|atut 2Note that recent advances in monolingual contextualized u is the intermediary representation of GRU out- embeddings like multilingual ELMo (Peters et al., 2018; Che t et al., 2018) and M-BERT (Pires et al., 2019; Devlin et al., (1) (1) put ht , and uS = tanh(MahS + ba) is that 2018) can also be supported to represent sentences for our (1) setting. We leave them as future work, as they require non- of the last GRU output hS . uS can be seen as trivial adaption to both multilingual settings and joint train- a high-level representation of the input sequence. ing, and extensive pre-training on external corpora. sum of projected word embeddings of the input simple multi-task strategy to bring significant im- (2) P|S| sentence, i.e. E (S) = t=1 Mbwt. provement to our cross-lingual tasks. Note that, besides BilBOWA, other joint-training bilingual 3.2 Basic Learning Objective embeddings in a unified space (Doval et al., 2018) The objective of learning the dictionary model is can also support this strategy, for which we leave to map the encodings of cross-lingual word defini- the comparison to future work. tions to the target word embeddings. This is real- 3.4 Joint Learning Objective ized by minimizing the following L2 loss, While above learning strategies are based on a ST 1 X j i 2 fixed embedding space, we lastly propose a joint Lij = Eij(Sw) − w 2 |Dij| learning strategy. During the training process, i j (w ,Sw)∈Dij this strategy simultaneously updates the embed- in which Eij is the dictionary model that maps ding space based on both the dictionary model and from descriptions in lj to words in li. the bilingual word embedding model. The learn- The above defines the basic model variants ing is through asynchronous minimization of the of BilDRL that learns on a single dictionary. following joint objective function, For word representations in the learning process, MT SG SG A BilDRL initializes the embedding space using J = Lij + λ1(Li + Lj ) + λ2Ωij pre-trained word embeddings. Note that, without adopting the joint learning strategy in Section 3.4, where λ1 and λ2 are two positive hyperparameters. SG SG the learning process does not update word embed- Li and Lj are the original Skip-Gram losses dings that are used to represent the definitions and (Mikolov et al., 2013b) to separately obtain word target words. While other forms of loss such as embeddings on monolingual corpora of li and lj. A cosine proximity (Hill et al., 2016) and hinge loss Ωij, termed as below, is the alignment loss to min- (Ji et al., 2017) may also be used in the learning imize bag-of-words distances for aligned sentence i j process, we find that L2 loss consistently leads to pairs (S ,S ) in parallel corpora Cij. better performance in our experiments. 1 X ΩA = d (Si,Sj) ij |C | S 3.3 Bilingual Multi-task Learning ij i j (S ,S )∈Cij In cases where entries in a bilingual dictionary 2 are not amply provided, learning the above bilin- i j 1 X i 1 X j dS(S ,S ) = w − w |Si| m |Sj| n gual dictionary on one ordered language pair may i i j wm∈S w ∈Sj fall short in insufficiency of alignment informa- n 2 tion. One practical solution is to conduct a bilin- The joint learning process adapts the embed- gual multi-task learning process. In detail, given ding space to better suit the dictionary model, which is shown to further enhance the cross- a language pair (li, lj), we learn the dictionary lingual learning of BilDRL. model Eij on both dictionaries Dij and Dji with shared parameters. Correspondingly, we rewrite 3.5 Training the previous learning objective function as below, To initialize the embedding space, we pre- in which D = Dij ∪ Dji. trained BilBOWA on the parallel corpora Eu- 1 X roparl v7 (Koehn, 2005) and monolingual cor- LMT = kE (S ) − wk2 ij |D| ij w 2 pora of tokenized Wikipedia dump (Al-Rfou et al., (w,Sw)∈D 2013). For models without joint learning, we use This strategy non-trivially requests the same AMSGrad (Reddi et al., 2018) to optimize the pa- dictionary model to represent semantic transfer in rameters. Each model without bilingual multi-task two directions of the language pair. To fulfill such learning thereof, is trained on batched samples a request, we initialize the embedding space using from each individual dictionary. Multi-task learn- the BilBOWA embeddings (Gouws et al., 2015), ing models are trained on batched samples from which provide a unified embedding space that re- two dictionaries. Within each batch, entries of solves both monolingual and cross-lingual seman- different directions of languages can be mixed to- tic relatedness of words. In practice, we find this gether. For joint learning, we conduct an efficient multi-threaded asynchronous training (Mnih et al., Dictionary En-Fr Fr-En En-Es Es-En #Target words 15,666 16,857 8,004 16,986 2016) of AMSGrad. In detail, after initializing the #Definitions 50,412 58,808 20,930 56,610 embedding space based on pre-trained BilBOWA, parameter updating based on the four components Table 1: Statistics of the bilingual dictionary dataset Wikt3l. of J occurs across four worker threads. Two monolingual threads select batches of monolin- Positive Examples En: Being remote in space. gual contexts from the Wikipedia dump of two lan- Fr: Se trouvant a` une grande distance. guages for Skip-Gram, one alignment thread ran- En: The interdisciplinary science that applies theories and domly samples parallel sentences from Europarl, methods of the physical sciences to questions of biology. Es: Ciencia que emplea y desarrolla las teorias y metodos´ de and one dictionary thread extracts samples of en- la f´ısica en la investigacion´ de los sistemas biolgicos. tries for a bilingual multi-task dictionary model. Negative Examples Each thread makes a batched update to model pa- En: A person who secedes or supports secession from a political union. rameters asynchronously for each term of J. The Fr: Controleˆ politique exerce´ par une grande puissance sur asynchronous training of all threads keeps going une contre infeod´ ee.´ until the dictionary thread finishes its epochs. En: The fear of closed, tight places. Es: Perdida´ o disminucion´ considerables de la memoria.

4 Experiments Table 2: Examples of bilingual paraphrases from WBP3l. We present experiments on two multilingual tasks: the cross-lingual reverse dictionary retrieval task Wikt3l for training BilDRL. To generate negative and the bilingual paraphrase identification task. examples, given a source word, we first find its 15 nearest neighbors in the embedding space. 4.1 Datasets Within the nearest neighbors, we use ConceptNet The experiment of cross-lingual reverse dictio- (Speer et al., 2017) to filter out the synonyms of nary retrieval is conducted on a trilingual dataset the source word, so as to prevent from generating Wikt3l. This dataset is extracted from Wik- false negative cases. Then we randomly pick tionary3, which is one of the largest freely avail- one word from the filtered neighbors and pair able multilingual dictionary resources on the Web. its cross-lingual definition with the English Wikt3l contains dictionary entries of language definition of the source word to create a negative pairs (English, French) and (English, Spanish), case. This process ensures that each negative which form En-Fr, Fr-En, En-Es and Es-En dic- case is endowed with limited dissimilarity of tionaries on four bridges of languages in total. sentence meanings, which makes the decision Two types of cross-lingual definitions are ex- more challenging. For each language setting, tracted from Wiktionary: (i) cross-lingual defini- we randomly select 70% for training, 5% for tions provided under the Translations sections of validation, and the rest 25% for testing. Note Wiktionary pages; (ii) monolingual definitions for that each language setting of this dataset thereof, words that are linked to a cross-lingual counterpart matches with the quantity and partitioning of with a inter-language link4 of Wiktionary. We ex- sentence pairs in the widely-used Microsoft clude all the definitions of stop words in construct- Research Paraphrase Corpus benchmark for ing the dataset, and list the statistics in Table1. monolingual paraphrase identification (Yin et al., Since existing datasets for paraphrase identi- 2016; Das and Smith, 2009). Several examples fication are merely monolingual, we contribute from the dataset are shown in Table2. The with another dataset WBP3l for cross-lingual datasets and the processing scripts are available sentential paraphrase identification. This dataset at https://github.com/muhaochen/ contains 6,000 pairs of bilingual sentence pairs bilingual_dictionaries. respectively for En-Fr and En-Es settings. Within each bilingual setting, positive cases are formed 4.2 Cross-lingual Reverse Dictionary as pairs of descriptions aligned by inter-language Retrieval links, which exclude the word descriptions in The objective of this task is to enable cross-lingual semantic retrieval of words based on descriptions. 3https://www.wiktionary.org/ 4An inter-language link matches the entries of counterpart Besides comparing variants of BilDRL that adopt words between language versions of Wiktionary. different sentence encoders and learning strate- Languages En-Fr Fr-En En-Es Es-En Metric P@1 P@10 MRR P@1 P@10 MRR P@1 P@10 MRR P@1 P@10 MRR BOW 0.8 3.4 0.011 0.4 2.2 0.006 0.4 2.4 0.007 0.4 2.6 0.007 CNN 6.0 12.4 0.070 6.4 14.8 0.072 3.8 7.2 0.045 7.0 16.8 0.088 GRU 35.6 46.0 0.380 38.8 49.8 0.410 47.8 59.0 0.496 57.6 67.2 0.604 ATT 38.8 47.4 0.411 39.8 50.2 0.425 51.6 59.2 0.534 60.4 68.4 0.629 GRU-mono 21.8 33.2 0.242 27.8 37.0 0.297 34.4 41.2 0.358 36.8 47.2 0.392 ATT-mono 22.8 33.6 0.249 27.4 39.0 0.298 34.6 42.2 0.358 39.4 48.6 0.414 GRU-MTL 43.4 49.2 0.452 44.4 52.8 0.467 50.4 60.0 0.530 63.6 71.8 0.659 ATT-MTL 46.8 56.6 0.487 47.6 56.6 0.497 55.8 62.2 0.575 66.4 75.0 0.687 ATT-joint 63.6 69.4 0.654 68.2 75.4 0.706 69.0 72.8 0.704 78.6 83.4 0.803

Table 3: Cross-lingual reverse dictionary retrieval results by BilDRL variants. We report P@1, P@10, and MRR on four groups of models: (i) basic dictionary models that adopt four different encoding techniques (BOW, CNN, GRU and ATT); (ii) models with the two best encoding techniques that enforce the monolingual retrieval approach by Hill et al.(2016) (GRU- mono and ATT-mono); (iii) models adopting bilingual multi-task learning (GRU-MTL and ATT-MTL); (iv) joint learning that employs the best dictionary model of ATT-MTL (ATT-joint). gies, we also compare with the monolingual re- ranks no larger than 10 P@10 (%), and mean re- trieval approach proposed by Hill et al.(2016). ciprocal rank MRR. Instead of directly associating cross-lingual word We pre-train BilBOWA based on the original definitions, this approach learns definition-to- configuration by Gouws et al.(2015) and obtain word mappings in a monolingual scenario. When 50-dimensional initialization of bilingual word it applies to the multilingual setting, given a lexical embedding spaces respectively for the English- definition, it first retrieves the corresponding word French and English-Spanish settings. For CNN, in the source language. Then, it looks for seman- GRU, and attentive GRU (ATT) encoders, we tically related words in the target language using stack five of each corresponding encoding layers bilingual word embeddings. As discussed in Sec- with hidden-sizes of 200, and two affine layers are tion3, NMT does not apply to this task due that it applied to the final output for dimension reduction. cannot capture the multi-granular correspondence This encoder architecture consistently represents between a sentence and a word. the best performance through our tuning. Through Evaluation Protocol. Before training the mod- comprehensive hyperparameter tuning, we fix the els, we randomly select 500 word definitions from learning rate α to 0.0005, the exponential decay each dictionary respectively as test cases, and rates of AMSGrad β1 and β2 to 0.9 and 0.999, co- exclude these definitions from the training data. efficients λ1 and λ2 to both 0.1, and batch size Each of the basic BilDRL variants are trained to 64. Kernel-size and pooling-size are both set on one bilingual dictionary. The monolingual re- to 2 for CNN. Word definitions are zero-padded trieval models are trained to fit the target words (short ones) or truncated (long ones) to the se- in the original languages of the word definitions, quence length of 15, since most definitions (over which are also provided in Wiktionary. BilDRL 92%) are within 15 words in the dataset. Training variants with multi-task or joint learning use both is limited to 1,000 epochs for all models as well as dictionaries of the same language pair. In the test the dictionary thread of asynchronous joint learn- i j phase, for each test case (w ,Sw) ∈ Dij, the ing, in which all models are able to converge. prediction performs a kNN search from the defi- Results. Results are reported in Table3 in four j nition encoding Eij(Sw), and record the rank of groups. The first group compares four different i w within the vocabulary of li. We limit the vo- encoding techniques for the basic dictionary mod- cabularies to all words that appear in the Wikt3l els. GRU thereof consistently outperforms CNN dataset, which involve around 45k English words, and BOW, since the latter two fail to capture the 44k French words and 36k Spanish words. To pre- important sequential information for descriptions. vent the surface information of the target word ATT that weighs among the hidden states has no- from appearing in the definition, we have also table improvements over GRU. While we equip masked out any translation of the target word oc- the two better encoding techniques with the mono- curring in the definition using a wildcard token lingual retrieval approach (GRU-mono and ATT- . We aggregate three metrics on test mono), we find that the way of learning the dic- cases: the accuracy P@1 (%), the proportion of tionary models towards monolingual targets and retrieving cross-lingual related words incurs more Languages En&Fr En&Es impreciseness to the task. For models of the third Metrics Acc. F1 Acc. F1 BiBOW 54.93 0.622 56.27 0.623 group that conduct multi-task learning in two di- BiCNN 54.33 0.625 53.80 0.611 rections of a language pair, the results show sig- ABCNN 56.73 0.644 58.83 0.655 nificant enhancement of performance in both di- BiLSTM 59.60 0.662 57.60 0.637 BiATT 61.47 0.699 61.27 0.689 rections. For the final group of results, we in- BilDRL-GRU-MTL 64.80 0.732 63.33 0.722 corporate the best variant of multi-task models BilDRL-ATT-MTL 65.27 0.735 66.07 0.735 into the joint learning architecture, which leads BilDRL-ATT-joint 68.53 0.785 67.13 0.759 to compelling improvement of the task on all set- Table 4: Accuracy and F1-scores of bilingual paraphrase tings. This demonstrates that properly adapting identification. For BilDRL, the results by three model the word embeddings in joint with the bilingual variants are reported: BilDRL-GRU-MTL and BilDRL- ATT-MTL are models with bilingual multi-task learning, and dictionary model efficaciously constructs the em- BilDRL-ATT-joint is the best ATT-based dictionary model bedding space that suits better the representation variant deployed with both multi-task and joint learning. of both bilingual lexical and sentential semantics. In general, this experiment has identified the BilDRL proper encoding techniques of the dictionary Evaluation protocol. transfers each model. The proposed strategies of multi-task and sentence into a vector in the word embedding joint learning effectively contribute to the precise space. Then, for each sentence pair in the train characterization of the cross-lingual correspon- set, a Multi-layer Perceptron (MLP) with a binary dence of lexical and sentential semantics, which softmax loss is trained on the subtraction of two have led to very promising capability of cross- vectors as a downstream classifier. Baseline mod- lingual reverse dictionary retrieval. els are trained end-to-end, each of which directly uses a parallel pair of encoders with shared param- 4.3 Bilingual Paraphrase Identification eters and an MLP that is stacked to the subtraction The bilingual paraphrase identification problem5 of two sentence vectors. Note that some works use is a binary classification task with the goal to de- concatenation (Yin and Schutze¨ , 2015) or Manhat- cide whether two sentences in different languages tan distances (Mueller and Thyagarajan, 2016) of express the same meanings. BilDRL provides an sentence vectors instead of their subtraction (Jiang effective solution by transferring sentential mean- et al., 2018), which we find to be less effective on ings to word-level representations and learning a small amount of data. simple classifier. We evaluate three variants of We apply the configurations of the sentence en- BilDRL on this task using WBP3l: the multi-task coders from the last experiment to corresponding BilDRL with GRU encoders (BilDRL-GRU- baselines, so as to show the performance under MTL), the multi-task BilDRL with attentive GRU controlled variables. Training of a classifier is ter- encoders (BilDRL-ATT-MTL), and the joint minated by early-stopping based on the validation learning BilDRL with with attentive GRU en- set. Following convention (Hu et al., 2014; Yin coders (BilDRL-ATT-joint). We compare against et al., 2016), we report the accuracy and F1 scores. several baselines of neural sentence pair mod- Results. This task is challenging due to the hetero- els that are proposed for monolingual paraphrase geneity of cross-lingual paraphrases and limited- identification. These models include Siamese ness of learning resources. The results in Table4 structures of CNN (BiCNN) (Yin and Schutze¨ , show that all the baselines, where BiATT con- 2015), RNN (BiLSTM) (Mueller and Thyagara- sistently outperforms the others, merely reaches jan, 2016), attentive CNN (ABCNN) (Yin et al., slightly over 60% of accuracy on both En-Fr and 2016), attentive GRU (BiATT) (Rocktaschel¨ et al., En-Es settings. We believe that it comes down to 2016), and BOW (BiBOW). To support the rea- the fact that sentences of different languages are soning of cross-lingual semantics, we provide the often drastically heterogenous in both lexical se- baselines with the same BilBOWA embeddings. mantics and the sentence grammar that governs 5Paraphrases have similar meanings, but can largely dif- the composition of words. Hence, it is not sur- fer in content details and word orders. Hence, they are essen- prising that previous neural sentence pair models, tially different from translations. We have found that even the which capture the semantic relation of bilingual well-recognized Google NMT frequently caused distortions to short sentence meanings, and led to results that were close sentences directly from all participating words, to random guess by the baseline classifiers after translation. fall short at the multilingual task. BilDRL, how- ever, effectively leverages the correspondence of Meeting of the Association for Computational Lin- lexical and sentential semantics to simplify the guistics (Volume 1: Long Papers), volume 1, pages task to an easier entailment task in the lexicon 451–462. space, for which the multi-task learning BilDRL- Colin Bannard and Chris Callison-Burch. 2005. Para- ATT-MTL outperforms the best baseline respec- phrasing with bilingual parallel corpora. In Pro- tively by 3.80% and 4.80% of accuracy in both ceedings of the 43rd Annual Meeting on Association language settings, while BilDRL-ATT-joint, em- for Computational Linguistics, pages 597–604. As- sociation for Computational Linguistics. ploying the joint learning, further improves the task by another satisfying 3.26% and 1.06% of ac- Sarath Chandar, Stanislas Lauly, Hugo Larochelle, curacy. Both also show notable increment in F1. Mitesh Khapra, Balaraman Ravindran, Vikas C Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. 5 Conclusion and Future Work In Advances in Neural Information Processing Sys- tems, pages 1853–1861. In this paper, we propose a neural embedding model BilDRL that captures the correspondence Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, of cross-lingual lexical and sentential semantics. and Ting Liu. 2018. Towards better UD parsing: We experiment with multiple forms of neural Deep contextualized word embeddings, ensemble, and treebank concatenation. In CoNLL Shared Task, models and identify the best technique. The pages 55–64. ACL. two learning strategies, bilingual multi-task learn- ing and joint learning, are effective at enhancing Muhao Chen, Chang Ping Meng, Gang Huang, and the cross-lingual learning with limited resources, Carlo Zaniolo. 2018a. Neural article pair modeling Proceedings and also achieve promising performance on cross- for wikipedia sub-article machine. In of European Conference of Machine Learning. lingual reverse dictionary retrieval and bilingual paraphrase identification tasks by associating lex- Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven ical and sentential semantics. An important di- Skiena, and Carlo Zaniolo. 2018b. Co-training em- rection of future work is to explore whether the beddings of knowledge graphs and entity descrip- tions for cross-lingual entity alignment. In Proceed- word-sentence alignment can improve bilingual ings of the 27th International Joint Conference on word embeddings. Applying BilDRL to bilingual Artificial Intelligence. question answering and semantic search systems is another important direction. Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning 6 Acknowledgement phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of We thank the anonymous reviewers for their in- the 2014 Conference on Empirical Methods in Nat- sightful comments. This work was supported in ural Language Processing (EMNLP), pages 1724– part by National Science Foundation Grant IIS- 1734. 1760523. Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization with at- tentive recurrent neural networks. In Proceedings of References the 2016 Conference of the North American Chap- ter of the Association for Computational Linguistics: Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Human Language Technologies, pages 93–98. 2013. Plyglot: Distributed word representations for multilingual nlp. In The SIGNLL Conference on Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc Computational Natural Language Learning. Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. natural language inference data. In Proceedings of Learning principled bilingual mappings of word em- the 2017 Conference on Empirical Methods in Nat- beddings while preserving monolingual invariance. ural Language Processing, pages 670–680. In Proceedings of the 2016 Conference on Empiri- cal Methods in Natural Language Processing, pages Jocelyn Coulmance, Jean-Marc Marty, Guillaume 2289–2294. Wenzek, and Amine Benhalloum. 2015. Trans- gram, fast cross-lingual word-embeddings. In Pro- Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. ceedings of the 2015 Conference on Empirical Meth- Learning bilingual word embeddings with (almost) ods in Natural Language Processing, pages 1109– no bilingual data. In Proceedings of the 55th Annual 1113. Dipanjan Das and Noah A Smith. 2009. Paraphrase Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al. identification as probabilistic quasi-synchronous 2017. Distant supervision for relation extraction recognition. In Proceedings of the Joint Confer- with sentence-level attention and entity descriptions. ence of the 47th Annual Meeting of the ACL and the In Proceedings of the AAAI International Confer- 4th International Joint Conference on Natural Lan- ence on Artificial Intelligence, pages 3060–3066. guage Processing, pages 468–476. Association for Computational Linguistics. Jyun-Yu Jiang, Francine Chen, Yan-Ying Chen, and Wei Wang. 2018. Learning to disentangle inter- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and leaved conversational threads with a siamese hierar- Kristina Toutanova. 2018. Bert: Pre-training of deep chical network and similarity ranking. In Proceed- bidirectional transformers for language understand- ings of the 2018 Conference of the North American ing. arXiv preprint arXiv:1810.04805. Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas (Long Papers), volume 1, pages 1812–1822. Lamar, Richard Schwartz, and John Makhoul. 2014. Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom, Fast and robust neural network joint models for sta- Dimitri Kartsaklis, Nal Kalchbrenner, Mehrnoosh tistical machine translation. In Proceedings of the Sadrzadeh, Nal Kalchbrenner, Phil Blunsom, Nal 52nd Annual Meeting of the Association for Compu- Kalchbrenner, and Phil Blunsom. 2014. A convo- tational Linguistics (Volume 1: Long Papers), vol- lutional neural network for modelling sentences. In ume 1, pages 1370–1380. Proceedings of the 52nd Annual Meeting of the As- sociation for Computational Linguistics, pages 212– Yerai Doval, Jose Camacho-Collados, Luis Espinosa 217. Association for Computational Linguistics. Anke, and Steven Schockaert. 2018. Improving cross-lingual word embeddings by meeting in the Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, middle. In Proceedings of the 2018 Conference on Richard Zemel, Raquel Urtasun, Antonio Torralba, Empirical Methods in Natural Language Process- and Sanja Fidler. 2015. Skip-thought vectors. In ing, pages 294–304. Advances in neural information processing systems, pages 3294–3302. Ben Eisner, Tim Rocktaschel,¨ Isabelle Augenstein, Matko Bosnjak,ˇ and Sebastian Riedel. 2016. Philipp Koehn. 2005. Europarl: A parallel corpus for emoji2vec: Learning emoji representations from statistical machine translation. In MT summit, vol- their description. arXiv preprint arXiv:1609.08359. ume 5, pages 79–86. Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Manaal Faruqui and Chris Dyer. 2014. Improving vec- Yiming Yang. 2017. Deep learning for extreme tor space word representations using multilingual multi-label text classification. In Proceedings of the correlation. In Proceedings of the 14th Conference 40th International ACM SIGIR Conference on Re- of the European Chapter of the Association for Com- search and Development in Information Retrieval, putational Linguistics , pages 462–471. pages 115–124. ACM. Stephan Gouws, Yoshua Bengio, and Greg Corrado. Thang Luong, Hieu Pham, and Christopher D Man- 2015. Bilbowa: Fast bilingual distributed represen- ning. 2015a. Bilingual word representations with tations without word alignments. In Proceedings monolingual quality in mind. In Proceedings of the of the 32nd International Conference on Machine 1st Workshop on Vector Space Modeling for Natural Learning, pages 748–756. Language Processing, pages 151–159.

Felix Hill, KyungHyun Cho, Anna Korhonen, and Thang Luong, Hieu Pham, and Christopher D Man- Yoshua Bengio. 2016. Learning to understand ning. 2015b. Effective approaches to attention- phrases by embedding the dictionary. Transactions based neural machine translation. In Proceedings of of the Association for Computational Linguistics, the 2015 Conference on Empirical Methods in Nat- 4:17–30. ural Language Processing, pages 1412–1421. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Exploiting similarities among languages for ma- Chen. 2014. Convolutional neural network architec- chine translation. arXiv preprint arXiv:1309.4168. tures for matching natural language sentences. In Advances in neural information processing systems, Tomas Mikolov, Ilya Sutskever, et al. 2013b. Dis- pages 2042–2050. tributed representations of words and phrases and their compositionality. In Advances in Neural In- Jan H Hulstijn, Merel Hollander, and Tine Grei- formation Processing Systems. danus. 1996. Incidental by ad- vanced foreign language students: The influence of Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi marginal glosses, dictionary use, and reoccurrence Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, of unknown words. The modern language journal, David Silver, and Koray Kavukcuoglu. 2016. Asyn- 80(3):327–339. chronous methods for deep reinforcement learning. In International Conference on Machine Learning, Yogarshi Vyas and Marine Carpuat. 2016. Sparse pages 1928–1937. bilingual word representations for cross-lingual lex- ical entailment. In Proceedings of the 2016 Confer- Jonas Mueller and Aditya Thyagarajan. 2016. Siamese ence of the North American Chapter of the Associ- recurrent architectures for learning sentence similar- ation for Computational Linguistics: Human Lan- ity. In Proceedings of the AAAI Conference on Arti- guage Technologies, pages 1187–1197. ficial Intelligence, pages 2786–2792. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Ani Nenkova, Kathleen McKeown, et al. 2012. A sur- Le, Mohammad Norouzi, Wolfgang Macherey, vey of text summarization techniques. In Mining Maxim Krikun, Yuan Cao, Qin Gao, Klaus text data, pages 43–76. Springer. Macherey, et al. 2016. Google’s neural ma- chine translation system: Bridging the gap between Matthew Peters, Mark Neumann, Mohit Iyyer, Matt human and machine translation. arXiv preprint Gardner, Christopher Clark, Kenton Lee, and Luke arXiv:1609.08144. Zettlemoyer. 2018. Deep contextualized word repre- sentations. In Proceedings of the 2018 Conference Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. of the North American Chapter of the Association Normalized word embedding and orthogonal trans- for Computational Linguistics: Human Language form for bilingual word translation. In Proceed- Technologies, Volume 1 (Long Papers), volume 1, ings of the 2015 Conference of the North Ameri- pages 2227–2237. can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. 1006–1011. How multilingual is multilingual bert? In ACL. Wenpeng Yin and Hinrich Schutze.¨ 2015. Convolu- Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. tional neural network for paraphrase identification. 2018. On the convergence of adam and beyond. In In Proceedings of the 2015 Conference of the North International Conference on Learning Representa- American Chapter of the Association for Computa- tions. tional Linguistics: Human Language Technologies.

Tim Rocktaschel,¨ Edward Grefenstette, Karl Moritz Wenpeng Yin, Hinrich Schutze,¨ et al. 2016. Abcnn: Hermann, Toma´sˇ Kociskˇ y,` and Phil Blunsom. 2016. Attention-based convolutional neural network for Reasoning about entailment with neural attention. modeling sentence pairs. Transactions of the Asso- In International Conference on Learning Represen- ciation for Computational Linguistics, 4(1). tations. Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Sebastian Ruder, Ivan Vulic,´ and Anders Søgaard. Cross-lingual sentiment classification with bilingual 2017. A survey of cross-lingual word embedding document representation learning. In Proceedings models. Journal of Artificial Intelligence Research. of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Lei Sha, Baobao Chang, et al. 2016. Reading and pers), volume 1, pages 1403–1412. thinking: Re-read lstm unit for textual entailment recognition. In Proceedings of the International Conference on Computational Linguistics.

Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Informa- tion Processing Systems, pages 6833–6844.

Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Confer- ence on Artificial Intelligence.

Shyam Upadhyay, Nitish Gupta, and Dan Roth. 2018. Joint multilingual supervision for cross-lingual en- tity linking. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2486–2495.

Ivan Vulic,´ Anna Korhonen, et al. 2016. On the role of seed lexicons in learning bilingual word embed- dings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), volume 1, pages 247–257.