Joint Language and Translation Modeling with Recurrent Neural Networks

Joint Language and Translation Modeling with Recurrent Neural Networks Michael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig Microsoft Research Redmond, WA, USA fmichael.auli,mgalley,chrisq,[email protected] Abstract predictions are based on an unbounded history of previous words. This is in contrast to feed-forward We present a joint language and transla- networks as well as conventional n-gram models, tion model based on a recurrent neural net- both of which are limited to fixed-length contexts. work which predicts target words based on an unbounded history of both source and tar- Building on the success of recurrent architectures, get words. The weaker independence as- we base our joint language and translation model sumptions of this model result in a vastly on an extension of the recurrent neural network lan- larger search space compared to related feed- guage model (Mikolov and Zweig, 2012) that intro- forward-based language or translation models. duces a layer of additional inputs (x2). We tackle this issue with a new lattice rescor- Most previous work on neural networks for ing algorithm and demonstrate its effective- speech recognition or machine translation used a ness empirically. Our joint model builds on a well known recurrent neural network language rescoring setup based on n-best lists (Arisoy et al., model (Mikolov, 2012) augmented by a layer 2012; Mikolov, 2012) for evaluation, thereby side of additional inputs from the source language. stepping the algorithmic and engineering challenges We show competitive accuracy compared to of direct decoder-integration.1 Instead, we exploit the traditional channel model features. Our lattices, which offer a much richer representation best results improve the output of a system of the decoder output, since they compactly encode trained on WMT 2012 French-English data by an exponential number of translation hypotheses in up to 1.5 BLEU, and by 1.1 BLEU on average polynomial space. In contrast, n-best lists are typi- across several test sets. cally very redundant, representing only a few com- binations of top scoring arcs in the lattice. A major 1 Introduction challenge in lattice rescoring with a recurrent neural Recently, several feed-forward neural network- network model is the effect of the unbounded history based language and translation models have on search since the usual dynamic programming as- achieved impressive accuracy improvements on sta- sumptions which are exploited for efficiency do not tistical machine translation tasks (Allauzen et al., hold up anymore. We apply a novel algorithm to the 2011; Le et al., 2012b; Schwenk et al., 2012). In this task of rescoring with an unbounded language model paper we focus on recurrent neural network archi- and empirically demonstrate its effectiveness (x3). tectures, which have recently advanced the state of The algorithm proves robust, leading to signif- the art in language modeling (Mikolov et al., 2010; icant improvements with the recurrent neural net- Mikolov et al., 2011a; Mikolov, 2012), outperform- work language model over a competitive n-gram ing multi-layer feed-forward based networks in both baseline across several language pairs. We even ob- perplexity and word error rate in speech recognition serve consistent gains when pairing the model with a (Arisoy et al., 2012; Sundermeyer et al., 2013). The large n-gram model trained on up to 575 times more major attraction of recurrent architectures is their 1One notable exception is Le et al. (2012a) who rescore reorder- potential to capture long-span dependencies since ing lattices with a feed-forward network-based model. data, demonstrating that the model provides comple- et mentary information (x4). Our joint modeling approach is based on adding a D continuous space representation of the foreign sen- yt U tence as an additional input to the recurrent neu- ht ral network language model. With this extension, V the language model can measure the consistency h t-1 W between the source and target words in a context- sensitive way. The model effectively combines the F functionality of both the traditional channel and lan- G guage model features. We test the power of this ft new model by using it as the only source of traditional channel information. Overall, we find that the model achieves accuracy competitive with the older channel model features and that it can improve over the gains observed with the recurrent neural network Figure 1: Structure of the recurrent neural network model, including the auxiliary input layer f . language model (x5). t 2 Model Structure The hidden and output layers are computed via a We base our model on the recurrent neural network series of matrix-vector products and non-linearities: language model of Mikolov et al. (2010) which is ht = s(Uet + Wht−1 + Ff t) factored into an input layer, a hidden layer with recurrent connections, and an output layer (Figure 1). yt = g(Vht + Gf t) The input layer encodes the target language word at where time t as a 1-of-N vector et, where jV j is the size 1 exp fzmg of the vocabulary, and the output layer yt represents s(z) = ; g(zm) = P a probability distribution over target words; both of 1 + exp {−zg k exp fzkg size jV j. The hidden layer state ht encodes the his- are sigmoid and softmax functions, respectively. tory of all words observed in the sequence up to time Additionally, the network is interpolated with a step t. This model is extended by an auxiliary input maximum entropy model of sparse n-gram features layer ft which provides complementary information over input words (Mikolov et al., 2011a).2 The max- to the input layer (Mikolov and Zweig, 2012). While imum entropy weights are added to the output acti- the auxiliary input layer can be used to feed in arbi- vations before computing the softmax. trary additional information, we focus on encodings The model is optimized via a maximum likeli- of the foreign sentence (x5). hood objective function using stochastic gradient The state of the hidden layer is determined by the descent. Training is based on the back propaga- input layer, the auxiliary input layer and the hidden tion through time algorithm, which unrolls the net- layer configuration of the previous time step ht−1. work and then computes error gradients over mul- The weights of the connections between the layers tiple time steps (Rumelhart et al., 1986). Af- are summarized in a number of matrices: U, F and ter training, the output layer represents posteriors t W, represent weights from the input layer to the hid- p(et+1jet−n+1; ht; ft); the probabilities of words in den layer, from the auxiliary input layer to the hid- the output vocabulary given the n previous input t den layer, and from the previous hidden layer to the words et−n+1, the hidden layer configuration ht as current hidden layer, respectively. Matrix V repre- well as the auxiliary input layer configuration ft. sents connections between the current hidden layer 2While these features depend on multiple input words, we de- and the output layer; G represents direct weights be- picted them for simplicity as a connection between the current tween the auxiliary input and output layers. input word vector et and the output layer (D). Na¨ıve computation of the probability distribution over the next word is very expensive for large vo- we have to maintain at least two words at each state, cabularies. A well established efficiency trick uses also known as the n-gram context. word-classing to create a more efficient two-step process (Goodman, 2001; Emami and Jelinek, 2005; 1: function RESCORELATTICE(k, V , E, s, T ) Mikolov et al., 2011b) where each word is assigned 2: Q TOPOLOGICALLY-SORT(V ) a unique class. To compute the probability of a 3: for all v in V do . Heaps of split-states word, we first compute the probability of its class, 4: Hv MINHEAP() and then multiply it by the probability of the word 5: end for ~ conditioned on the class: 6: h0 0 . Initialize start-state 7: Hs.ADD(h0) t 8: for all v in Q do . Examine outgoing arcs p(et+1jet−n+1; ht; ft) = t t 9: for hv; xi in E do p(cijet−n+1; ht; ft) × p(et+1jci; et−n+1; ht; ft) 10: for h in Hv do . Extend LM states 11: h0 SCORERNN(h; phrase(h)) This factorization reduces the complexity of com- 12: parent(h0) h . Backpointers puting the output probabilities from O(jV j) to 13: if Hx:size() ≥ k^ . Beam width O(jCj + maxi jcij) where jCj is the number of 0 14: Hx:MIN()<score(h ) then classes and jcij is the number of words in class p 15: Hx:REMOVEMIN() ci. The best case complexity O( jV j) requires the 16: if Hx:size()<k then number of classes and words to be evenly balanced, 0 17: Hx:ADD(h ) i.e., each class contains exactly as many words as 18: end for there are classes. 19: end for 20: end for 3 Lattice Rescoring with an Unbounded 21: I = MAXHEAP() Language Model 22: for all t in T do . Find best final split-state We evaluate our joint language and translation 23: I:MERGE(Ht) model in a lattice rescoring setup, allowing us to 24: end for search over a much larger space of translations than 25: return I:MAX() would be possible with n-best lists. While very 26: end function space efficient, lattices also impose restrictions on Figure 2: Push-forward rescoring with a recurrent neu- the context available to features, a particularly chal- ral network language model given a beam-width for lan- lenging setting for our model which depends on the guage model split-states k, decoder states V , edges E, a entire prefix of a translation.

Load more