Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) BattRAE: Bidimensional Attention-Based Recursive Autoencoders for Learning Bilingual Phrase Embeddings Biao Zhang,1 Deyi Xiong,2 Jinsong Su1∗ Xiamen University, Xiamen, China 3610051 Soochow University, Suzhou, China 2150062 [email protected], [email protected], [email protected] Abstract Src (Ref) Tgt shijie geda chengshi in other major cities In this paper, we propose a bidimensional attention based re- (major cities in the world) cursive autoencoder (BattRAE) to integrate clues and source- dui jingji xuezhe to economists target interactions at multiple levels of granularity into bilin- (to the economists) gual phrase representations. We employ recursive autoencoders to generate tree structures of phrases with embeddings Table 1: Examples of bilingual phrases ((romanized) at different levels of granularity (e.g., words, sub-phrases and Chinese-English) from our translation model. The important phrases). Over these embeddings on the source and target side, we introduce a bidimensional attention network to learn words or phrases are highlighted in bold. Src = source, Tgt their interactions encoded in a bidimensional attention ma- = target, Ref = literal translation of source phrase. trix, from which we extract two soft attention weight distributions simultaneously. These weight distributions enable BattRAE to generate compositive phrase representations via either explore clues (linguistic items from contexts) only at convolution. Based on the learned phrase representations, we a single level of granularity or capture interactions (align- further use a bilinear neural model, trained via a max-margin method, to measure bilingual semantic similarity. To evaluate ments between source and target items) only at the same the effectiveness of BattRAE, we incorporate this semantic level of granularity to learn bilingual phrase embeddings. similarity as an additional feature into a state-of-the-art SMT We believe that clues and interactions from a single system. Extensive experiments on NIST Chinese-English test level of granularity are not adequate to measure underly- sets show that our model achieves a substantial improvement ing semantic similarity of bilingual phrases due to the high of up to 1.63 BLEU points on average over the baseline. language divergence. Take the Chinese-English translation pairs in Table 1 as examples. At the word level of granular- 1 Introduction ity, we can easily recognize that the translation of the first instance is not faithful as Chinese word “shijie”(world)is As one of the most important components in statistical ma- not translated at all. While in the second instance, seman- chine translation (SMT), translation model measures the tic judgment at the word level is not sufficient as there is translation faithfulness of a hypothesis to a source frag- no translation for single Chinese word “jingji”(economy) ment (Och and Ney 2003; Koehn, Och, and Marcu 2003; or “xuezhe”(scholar). We have to elevate the calculation Chiang 2007). Conventional translation models extract a of semantic similarity to a higher sub-phrase level: “jingji huge number of bilingual phrases with conditional trans- xuezhe” vs. “economists”. This suggests that clues and in- lation probabilities and lexical weights (Koehn, Och, and teractions between the source and target side at multiple lev- Marcu 2003). Due to the heavy reliance of the calculation els of granularity should be explored to measure semantic of these probabilities and weights on surface forms of bilin- similarity of bilingual phrases. gual phrases, traditional translation models often suffer from In order to capture multi-level clues and interactions, we the problem of data sparsity. This leads researchers to in- propose a bidimensional attention based recursive autoen- vestigate methods that learn underlying semantic represen- coder (BattRAE). It learns bilingual phrase embeddings ac- tations of phrases using neural networks (Gao et al. 2014; cording to the strengths of interactions between linguistic Zhang et al. 2014; Cho et al. 2014; Su et al. 2015). items at different levels of granularity (i.e., words, sub- Typically, these neural models learn bilingual phrase em- phrases and entire phrases) on the source side and those on beddings in a way that embeddings of source and corre- the target side. The philosophy behind BattRAE is twofold: sponding target phrases are optimized to be close as much as 1) Phrase embeddings are learned from weighted clues at possible in a continuous space. In spite of their success, they different levels of granularity; 2) The weights of clues are ∗Corresponding author. calculated according to the alignments of linguistic items at Copyright c 2017, Association for the Advancement of Artificial different levels of granularity between the source and tar- Intelligence (www.aaai.org). All rights reserved. get side. We introduce a bidimensional attention network to 3372 Semantic Similarity tically related parts of bilingual phrases and assign higher bilinear model weights to these parts for constructing final bilingual phrase embeddings than those not semantically related. final source phrase representation target soft weights 2 Learning Embeddings at Different Levels of Granularity soft weights We use recursive autoencoders (RAE) to learn initial embed- phrase phrase dings at different levels of granularity for our model. Com- sub-phrases sub-phrases bining two children vectors from the bottom up recursively, words words RAE is able to generate low-dimensional vector representa- Bidimensional Matrix tions for variable-sized sequences. The recursion procedure RAERAE RAERAE usually consists of two neural operations: composition and reconstruction. source phrase target phrase Composition: Typically, the input to RAE is a list of or- dered words in a phrase (x1,x2,x3), each of which is em- Figure 1: Overall architecture for the proposed BattRAE bedded into a d-dimensional continuous vector.1 In each re- model. We use blue and red color to indicate the source- and cursion, RAE selects two neighboring children (e.g. c1 = x1 target-related representations or structures respectively. The and c2 = x2) via some selection criterion, and then compose gray colors indicate real values in the bidimensional mecha- them into a parent embedding y1, which can be computed as nism. follows: (1) (1) y1 = f(W [c1; c2]+b ) (1) 2d learn the strengths of these alignments. Figure 1 illustrates where [c1; c2] ∈ R is the concatenation of c1 and c2, the overall architecture of BattRAE model. Specifically, W (1) ∈ Rd×2d is a parameter matrix, b(1) ∈ Rd is a bias • First, we adopt recursive autoencoders to generate hierar- term, and f is an element-wise activation function such as chical structures of source and target phrases separately. tanh(·), which is used in our experiments. At the same time, we can also obtain embeddings of mul- Reconstruction: After the composition, we obtain the rep- tiple levels of granularity, i.e., words, sub-phrases and en- resentation for the parent y1 which is also a d-dimensional tire phrases from the generated structures. (see Section 2) vector. In order to measure how well the parent y1 represents its children, we reconstruct the original child nodes via a re- • Second, BattRAE projects the representations of linguis- construction layer: tic items at different levels of granularity onto an atten- (2) (2) tion space, upon which the alignment strengths of linguis- [c1; c2]=f(W y1 + b ) (2) tic items from the source and target side are calculated (2) 2d×d by estimating how well they semantically match. These here c1 and c2 are the reconstructed children, W ∈ R alignment scores are stored in a bidimensional attention and b(2) ∈ R2d. The minimum Euclidean distance between matrix. Over this matrix, we perform row (column)-wise [c1; c2] and [c1; c2] is usually used as the selection criterion summation and softmax operations to generate attention during composition. weights on the source (target) side. The final phrase rep- These two standard processes form the basic procedure of resentations are computed as the weighted sum of their RAE, which repeat until the embedding of the entire phrase initial embeddings using these attention weights. (see is generated. In addition to phrase embeddings, RAE also Section3.1) constructs a binary tree. The structure of the tree is deter- • Finally, BattRAE projects the bilingual phrase representa- mined by the used selection criterion in composition. As tions onto a common semantic space, and uses a bilinear generating the optimal binary tree for a phrase is usually model to measure their semantic similarity. (see Section intractable, we employ a greedy algorithm (Socher et al. 3.2) 2011b) based on the following reconstruction error: We train the BattRAE model with a max-margin method, 1 2 Erec(x)= [c1; c2]y − [c ; c ]y (3) which maximizes the semantic similarity of translation 2 1 2 equivalents and minimizes that of non-translation pairs (see y∈T (x) Section 3.3). Parameters W (1) and W (2) are thereby learned to minimize In order to verify the effectiveness of BattRAE in learning the sum of reconstruction errors at each intermediate node y bilingual phrase representations, we incorporate the learned in the binary tree T (x). For more details, we refer the read- semantic similarity of bilingual phrases as a new feature ers to (Socher et al. 2011b). into SMT for translation selection. We conduct experiments Given an binary tree learned by RAE, we regard each level with a state-of-the-art SMT system on large-scale training of the tree as a level of granularity. In this way, we can use data. Results on the NIST 2006 and 2008 datasets show that BattRAE achieves significant improvements over base- 1Generally, all these word vectors are stacked into a word em- line methods. Further analysis on the bidimensional atten- bedding matrix L ∈ Rd×|V |, where |V | is the size of the vocabu- tion matrix reveals that BattRAE is able to detect seman- lary.

Load more