Joint Language and Translation Modeling with Recurrent Neural Networks

Michael Auli, Michel Galley, Chris Quirk, Geoffrey Zweig Microsoft Research Redmond, WA, USA {michael.auli,mgalley,chrisq,gzweig}@microsoft.com

Abstract predictions are based on an unbounded history of previous words. This is in contrast to feed-forward We present a joint language and transla- networks as well as conventional n-gram models, tion model based on a recurrent neural net- both of which are limited to fixed-length contexts. work which predicts target words based on an unbounded history of both source and tar- Building on the success of recurrent architectures, get words. The weaker independence as- we base our joint language and translation model sumptions of this model result in a vastly on an extension of the lan- larger search space compared to related feed- guage model (Mikolov and Zweig, 2012) that intro- forward-based language or translation models. duces a layer of additional inputs (§2). We tackle this issue with a new lattice rescor- Most previous work on neural networks for ing algorithm and demonstrate its effective- or used a ness empirically. Our joint model builds on a well known recurrent neural network language rescoring setup based on n-best lists (Arisoy et al., model (Mikolov, 2012) augmented by a layer 2012; Mikolov, 2012) for evaluation, thereby side of additional inputs from the source language. stepping the algorithmic and engineering challenges We show competitive accuracy compared to of direct decoder-integration.1 Instead, we exploit the traditional channel model features. Our lattices, which offer a much richer representation best results improve the output of a system of the decoder output, since they compactly encode trained on WMT 2012 French-English data by an exponential number of translation hypotheses in up to 1.5 BLEU, and by 1.1 BLEU on average polynomial space. In contrast, n-best lists are typi- across several test sets. cally very redundant, representing only a few com- binations of top scoring arcs in the lattice. A major 1 Introduction challenge in lattice rescoring with a recurrent neural Recently, several feed-forward neural network- network model is the effect of the unbounded history based language and translation models have on search since the usual dynamic programming as- achieved impressive accuracy improvements on sta- sumptions which are exploited for efficiency do not tistical machine translation tasks (Allauzen et al., hold up anymore. We apply a novel algorithm to the 2011; Le et al., 2012b; Schwenk et al., 2012). In this task of rescoring with an unbounded language model paper we focus on recurrent neural network archi- and empirically demonstrate its effectiveness (§3). tectures, which have recently advanced the state of The algorithm proves robust, leading to signif- the art in language modeling (Mikolov et al., 2010; icant improvements with the recurrent neural net- Mikolov et al., 2011a; Mikolov, 2012), outperform- work language model over a competitive n-gram ing multi-layer feed-forward based networks in both baseline across several language pairs. We even ob- perplexity and word error rate in speech recognition serve consistent gains when pairing the model with a (Arisoy et al., 2012; Sundermeyer et al., 2013). The large n-gram model trained on up to 575 times more major attraction of recurrent architectures is their 1One notable exception is Le et al. (2012a) who rescore reorder- potential to capture long-span dependencies since ing lattices with a feed-forward network-based model. data, demonstrating that the model provides comple- et mentary information (§4). Our joint modeling approach is based on adding a D continuous space representation of the foreign sen- yt U tence as an additional input to the recurrent neu- ht ral network language model. With this extension, V the language model can measure the consistency h t-1 W between the source and target words in a context- sensitive way. The model effectively combines the F functionality of both the traditional channel and lan- G guage model features. We test the power of this ft new model by using it as the only source of tradi- tional channel information. Overall, we find that the model achieves accuracy competitive with the older channel model features and that it can improve over the gains observed with the recurrent neural network Figure 1: Structure of the recurrent neural network model, including the auxiliary input layer f . language model (§5). t

2 Model Structure The hidden and output layers are computed via a We base our model on the recurrent neural network series of matrix-vector products and non-linearities: language model of Mikolov et al. (2010) which is ht = s(Uet + Wht−1 + Ff t) factored into an input layer, a hidden layer with re- current connections, and an output layer (Figure 1). yt = g(Vht + Gf t) The input layer encodes the target language word at where time t as a 1-of-N vector et, where |V | is the size 1 exp {zm} of the vocabulary, and the output layer yt represents s(z) = , g(zm) = P a distribution over target words; both of 1 + exp {−z} k exp {zk} size |V |. The hidden layer state ht encodes the his- are sigmoid and softmax functions, respectively. tory of all words observed in the sequence up to time Additionally, the network is interpolated with a step t. This model is extended by an auxiliary input maximum entropy model of sparse n-gram features layer ft which provides complementary information over input words (Mikolov et al., 2011a).2 The max- to the input layer (Mikolov and Zweig, 2012). While imum entropy weights are added to the output acti- the auxiliary input layer can be used to feed in arbi- vations before computing the softmax. trary additional information, we focus on encodings The model is optimized via a maximum likeli- of the foreign sentence (§5). hood objective function using stochastic gradient The state of the hidden layer is determined by the descent. Training is based on the back propaga- input layer, the auxiliary input layer and the hidden tion through time algorithm, which unrolls the net- layer configuration of the previous time step ht−1. work and then computes error gradients over mul- The weights of the connections between the layers tiple time steps (Rumelhart et al., 1986). Af- are summarized in a number of matrices: U, F and ter training, the output layer represents posteriors t W, represent weights from the input layer to the hid- p(et+1|et−n+1, ht, ft); the of words in den layer, from the auxiliary input layer to the hid- the output vocabulary given the n previous input t den layer, and from the previous hidden layer to the words et−n+1, the hidden layer configuration ht as current hidden layer, respectively. Matrix V repre- well as the auxiliary input layer configuration ft. sents connections between the current hidden layer 2While these features depend on multiple input words, we de- and the output layer; G represents direct weights be- picted them for simplicity as a connection between the current tween the auxiliary input and output layers. input word vector et and the output layer (D). Na¨ıve computation of the over the next word is very expensive for large vo- we have to maintain at least two words at each state, cabularies. A well established efficiency trick uses also known as the n-gram context. word-classing to create a more efficient two-step process (Goodman, 2001; Emami and Jelinek, 2005; 1: function RESCORELATTICE(k, V , E, s, T ) Mikolov et al., 2011b) where each word is assigned 2: Q ← TOPOLOGICALLY-SORT(V ) a unique class. To compute the probability of a 3: for all v in V do . Heaps of split-states word, we first compute the probability of its class, 4: Hv ← MINHEAP() and then multiply it by the probability of the word 5: end for ~ conditioned on the class: 6: h0 ← 0 . Initialize start-state 7: Hs.ADD(h0) t 8: for all v in Q do . Examine outgoing arcs p(et+1|et−n+1, ht, ft) = t t 9: for hv, xi in E do p(ci|et−n+1, ht, ft) × p(et+1|ci, et−n+1, ht, ft) 10: for h in Hv do . Extend LM states 11: h0 ← SCORERNN(h, phrase(h)) This factorization reduces the complexity of com- 12: parent(h0) ← h . Backpointers puting the output probabilities from O(|V |) to 13: if Hx.size() ≥ k∧ . Beam width O(|C| + maxi |ci|) where |C| is the number of 0 14: Hx.MIN()

Table 2: French-English results when rescoring with the recurrent neural network language model; the baseline relies on an n-gram model trained on 1.15bn words.

dev news2010 news2011 newssyscomb2011 Avg(test) Baseline 21.2 20.7 19.2 20.6 20.0 +RNNLM (2m) 21.8 20.9 19.4 20.9 20.3 +RNNLM (50m) 22.1 21.1 19.7 21.0 20.5

Table 3: German-English results when rescoring with the recurrent neural network language model.

dev news2010 news2011 newssyscomb2011 Avg(test) Baseline 15.2 15.6 14.3 15.7 15.1 +RNNLM (2m) 15.7 15.9 14.6 16.0 15.4 +RNNLM (50m) 15.8 15.9 14.7 16.1 15.5

Table 4: English-German results when rescoring with the recurrent neural network language model; the baseline relies on an n-gram model trained on 327m words. train a transform between the source words and the the source words. reference representations. This leads to the best re- LSA is widely used for representing words and sults improving 1.5 BLEU over the 1-best decoder documents in low-dimensional vector space. The output and adding 0.2 BLEU on average to the gains method applies reduced singular value decomposi- achieved by the recurrent language model (§5.4). tion (SVD) to a matrix M of word counts; in our Setup. Conventional language models can be setting, rows represent sentences and columns rep- trained on monolingual or bilingual data; however, resent foreign words. SVD reduces the number the joint model can only be trained on the latter. of columns while preserving similarity among the In order to control for data size effects, we restrict rows, effectively mapping from a high-dimensional training of all models, including the baseline n-gram representation of a sentence, as a set of words, to a model, to the target side of the parallel corpus, about low-dimensional set of concepts. The output of SVD 102m words for French-English. Furthermore we is an approximation of M by three matrices: T con- train recurrent models only on the newswire portion tains single word representations, R represents full (about 3.5m words for training and 250k words for sentences, and S is a diagonal scaling matrix: validation) since initial experiments showed compa- T rable results to using the full parallel corpus, avail- M ≈ TSR able to the baseline. This is reasonable since the test Given vocabulary V and n sentences, we construct data is newswire. Also, it allows for more rapid ex- M as a matrix of size |V × n|. The ij-th entry is the perimentation. number of times word i occurs in sentence j, also known as the term frequency value; the entry is also 5.1 Foreign Sentence Representations weighted by the inverse document frequency, the rel- We represent foreign sentences either by latent se- ative importance of word i among all sentences, ex- mantic analysis (LSA; Deerwester et al. 1990) or by pressed as the negative logarithm of the fraction of word encodings produced as a by-product of train- sentences in which word i occurs. ing the recurrent neural network language model on As a second representation we use single word embeddings implicitly learned by the input layer −p(f|e) weights U of the recurrent neural network language −p(e|f) −p(e|f) model (§2), denoted as RNN. Each word is repre- Baseline without CM 24.0 22.5 sented by a vector of size |hi|, the number of neu- + target-only 24.5 22.6 rons in the hidden layer; in our experiments, we + sent-lsa-dim240 24.9 23.3 consider concatenations of individual word vectors + ww-rnn-dim50.n5 24.9 24.0 to represent foreign word contexts. These encodings + ww-rnn-dim50.p5n5 24.6 23.7 have previously been found to capture syntactic and + ww-rnn-dim50.p3 24.6 22.3 semantic regularities (Mikolov et al., 2013) and are + ww-rnn-dim50.c5 24.9 24.0 readily available in our experimental framework via + ww-lsa-dim50.n5 24.8 23.9 training a recurrent neural network language model + ww-lsa-dim50.p3 23.8 23.2 on the source-side of the parallel corpus. Table 6: Comparison of the joint model and the chan- 5.2 Results nel model features (CM) by removing channel features corresponding to −p(e|f) from the lattices, or both di- We first experiment with the two previously intro- rections −p(e|f), −p(f|e) and replacing them by vari- duced representations of the source-side sentence. ous joint models. We re-tuned the log-linear weights for Table 5 shows the results compared to the 1-best de- different feature-sets. Accuracy is based on the average coder output and an RNN language model (target- BLEU over news2010, newssyscomb2010, news2011. only). We first try LSA encodings of the entire foreign sentence as 80 or 240 dimensional vectors (sent-lsa-dim80, sent-lsa-dim240). Next, we experi- ment with single-word RNN representations of slid- model. The lowest perplexity is achieved by the ing word-windows in the hope of representing rel- context covering the aligned source words (ww-rnn- evant context more precisely. Word-windows are dim50.c5) since the source words are a better pre- constructed relative to the source words aligned to dictor of the target words than outside context. the current target word, and individual word vec- The experiments so far measured if the joint tors are concatenated into a single vector. We model can improve in addition to the four channel first try contexts which do not include the aligned model features used by the baseline, that is, the max- source words, in the hope of capturing information imum likelihood and lexical translation features in not already modeled by the channel models, start- both translation directions. The joint model clearly ing with the next five words (ww-rnn-dim50.n5), overlaps with these features, but how well does the five previous and the next five words (ww-rnn- the recurrent model perform compared against the dim50.p5n5) as well as the previous three words channel model features? To answer this question, (ww-rnn-dim50.p3). Next, we experiment with we removed channel model features corresponding word-windows of up to five aligned source words to the same translation direction as the joint model, (ww-rnn-dim50.c5). Finally, we try contexts based specifically pMLE(e|f) and pLW (e|f), from the lat- on LSA word vectors (ww-lsa-dim50.n5, ww-lsa- tices and measured the effect of adding the joint dim50.p3).5 models. While all models improve over the baseline, none The results (Table 6, column −p(e|f)) clearly significantly outperforms the recurrent neural net- show that our joint models are competitive with the work language model in terms of BLEU. However, channel model features by outperforming the orig- the perplexity results suggest that the models uti- inal baseline with all channel model features (24.7 lize the foreign representations since all joint mod- BLEU) by 0.2 BLEU (ww-rnn-dim50.n5, ww-rnn- els improve vastly over the target-only language dim50.c5). As a second experiment, we removed all −p(e|f), p(f|e) 5 channel model features (column ), We ignore the coverage vector when determining word- diminishing baseline accuracy to 22.5 BLEU. In this windows which risks including already translated words. Building word-windows based on the coverage vector requires setting, the best joint model is able to make up 1.5 additional state in a rescoring setting meant to be light-weight. of the 2.2 BLEU lost due to removal of the channel dev news2010 news2011 newssyscomb2010 Avg(test) PPL Baseline 24.3 24.4 25.1 24.3 24.7 341 target-only 25.1 25.1 26.4 25.0 25.6 218 sent-lsa-dim80 25.2 25.2 26.3 25.1 25.6 147 sent-lsa-dim240 25.1 25.0 26.2 24.9 25.4 126 ww-rnn-dim50.n5 24.9 25.0 26.3 24.8 25.4 61 ww-rnn-dim50.p5n5 25.0 24.8 26.2 24.7 25.3 59 ww-rnn-dim50.p3 25.1 25.1 26.5 24.9 25.6 143 ww-rnn-dim50.c5 24.8 24.9 26.0 24.8 25.3 16 ww-lsa-dim50.n5 25.0 25.0 26.2 24.8 25.4 76 ww-lsa-dim50.p3 25.1 25.1 26.5 24.9 25.6 151

Table 5: Translation accuracy of the joint model with various encodings of the foreign sentence measured on the French-English task. Perplexity (PPL) is based on news2011.

BLEU PPL model features, while modeling only a single trans- Baseline 25.2 341 lation direction. This setup also shows the negligible target-only 26.4 218 effect of the target-only language model in the ab- oracle (sent-lsa-dim40) 27.7 124 sence of translation scores, whereas the joint models oracle (sent-lsa-dim80) 28.5 103 are much more effective since they do model transla- oracle (sent-lsa-dim160) 29.0 86 tion. Overall, the best joint models prove very com- oracle (sent-lsa-dim240) 29.5 76 petitive to the traditional channel features. Table 7: Oracle accuracy of the joint model when us- 5.3 Oracle Experiment ing an LSA encoding of the references, measured on the The previous section examined the effect of a set news2011 French-English task. of basic foreign sentence representations. Although we find some benefit from these representations, the 5.4 Target Language Projections differences are not large. One might naturally ask whether there is greater potential upside from this Our experiments so far showed that joint models channel model. Therefore we turn to measuring the based on direct representations of the source words upper bound on accuracy for the joint approach as a are very competitive to the traditional channel mod- whole. els (§5.2). However, these experiments have not Specifically, we would like to find a bound on ac- shown any improvements over the normal recurrent curacy given an ideal representation of the source neural network language model. The previous sec- sentence. To answer this question, we conducted an tion demonstrated that good representations can lead experiment where the joint model has access to an to substantial gains (§5.3). In order to bridge the gap, LSA representation of the reference translation. we propose to learn a separate transform from the Table 7 shows that the joint approach has an ora- foreign words to an encoding of the reference target cle accuracy of up to 4.3 BLEU above the baseline. words, thus making the source-side representations This clearly confirms that the joint approach can ex- look more like the target-side encodings used in the ploit the additional information to improve BLEU, oracle experiment. given a good enough representation of the foreign Specifically, we learn a linear transform sentence. In terms of perplexity, we see an improve- dθ : x → r mapping directly from a vector en- ment of up to 65% over the target-only model. It coding of the foreign sentence x to an l-dimensional should be noted that since LSA representations are LSA representation r of the reference sentence. At computed on reference words, perplexity no longer test and training time we apply dθ to the foreign has its standard meaning. words and use the transformation instead of a direct dev news2010 news2011 newssyscomb2010 Avg(test) PPL Baseline 24.3 24.4 25.1 24.3 24.7 341 target-only 25.1 25.1 26.4 25.0 25.6 218 proj-lsa-dim40 25.1 25.3 26.5 25.2 25.8 145 proj-lsa-dim80 25.1 25.3 26.6 25.2 25.8 134

Table 8: Translation accuracy of the joint model with a source-target transform, measured on the French-English task. Perplexity (PPL) is based on news2011; differences to target-only are significant at the p < 0.001 level. source-side representation. BLEU and by 1.1 BLEU on average across all test The transform models all foreign words in the par- sets over the decoder 1-best with our joint language allel corpus except singletons, which are collapsed and translation model. into a unique class, similar to the recurrent neural network language model. We train the transform to 6 Related Work minimize the squared error with respect to the ref- Our approach of combining language and translation erence LSA vector using an SGD online learner: modeling is very much in line with recent work on n 2 n-gram-based translation models (Crego and Yvon, ∗ X   θ = arg min ri − dθ(xi) (1) 2010), and more recently continuous space-based θ i=1 translation models (Le et al., 2012a; Gao et al., We found a simple constant learning rate, tuned 2013). The joint model presented in this paper dif- on the validation data, to be as effective as sched- fers in a number of key aspects: we use a recur- ules based on constant decay, or reducing the learn- rent architecture representing an unbounded history ing rate when the validation error increased. Our of both source and target words, rather than a feed- feature-set includes unigram and word fea- forward style network. Feed-forward networks and tures. The value of unigram features is simply the n-gram models have a finite history which makes unigram count in that sentence; bigram features re- predictions independent of anything but a small his- ceive a weight of the bigram count divided by two tory of words. Furthermore, we only model the to help prevent overfitting. Then the vector for each target-side which is different to previous work mod- sentence was divided by its L2 norm. Both weight- eling both sides. ing and normalization led to substantial improve- We introduced a new algorithm to tackle lattice ments in test set error. More complex features such rescoring with an unbounded model. The auto- as skip-bigrams, trigrams and character n-grams did matic speech recognition community has previously not yield any significant improvements. Even this addressed this issue by either approximating long- representation of sentences is composed of a large span language models via simpler but more tractable number of instances, and so we resorted to feature models (Deoras et al., 2011b), or by identifying con- hashing by computing feature ids as the least signif- fusable subsets of the lattice from which n-best lists icant 20 bits of each feature name. Our best trans- are constructed and rescored (Deoras et al., 2011a). form achieved a cosine similarity of 0.816 on the We extend their work by directly mapping a recur- training data, 0.757 on the validation data, and 0.749 rent neural network model onto the structure of the on news2011. lattice, rescoring all states instead of focusing only The results (Table 8) show that the transform im- on subsets. proves over the recurrent neural network language 7 Conclusion and Future Work model on all test sets and by 0.2 BLEU on average. We verified significance over the target-only model Joint language and translation modeling with recur- using paired bootstrap resampling (Koehn, 2004) rent neural networks leads to substantial gains over over all test sets (7526 sentences) at the p < 0.001 the 1-best decoder output, raising accuracy by up level. Overall, we improve accuracy by up to 1.5 to 1.5 BLEU and by 1.1 BLEU on average across several test sets. The joint approach also improves Ebru Arisoy, Tara N. Sainath, Brian Kingsbury, and Bhu- over the gains of the recurrent neural network lan- vana Ramabhadran. 2012. Deep Neural Network guage model, adding 0.2 BLEU on average across Language Models. In NAACL-HLT Workshop on the several test sets. Our models are competitive to the Future of Language Modeling for HLT, pages 20–28, Stroudsburg, PA, USA. Association for Computational traditional channel models, outperforming them in a Linguistics. head-to-head comparison. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vin- Furthermore, we tackled the issue of lattice cent J. Della Pietra, and Jenifer C. Lai. 1992. Class- rescoring with an unbounded recurrent model by based n-gram models of natural language. Computa- means of a novel algorithm that keeps a beam of re- tional Linguistics, 18(4):467–479, Dec. current histories. Finally, we have shown that the Josep Crego and Franois Yvon. 2010. Factored bilingual recurrent neural network language model can sig- n-gram language models for statistical machine trans- nificantly improve over n-gram baselines across a lation. Machine Translation, 24(2):159–175. range of language-pairs, even when the baselines Scott Deerwester, Susan T. Dumais, George W. Furnas, were trained on 575 times more data. Thomas K. Landauer, and Richard Harshman. 1990. Indexing by Latent Semantic Analysis. Journal of the In future work we plan to directly learn represen- American Society for Information Science, 41(6):391– tations of the source-side during training of the joint 407. model. Thus, the model itself can decide which en- Anoop Deoras, Toma´sˇ Mikolov, and Kenneth Church. coding is best for the task. We also plan to change 2011a. A Fast Re-scoring Strategy to Capture Long- the cross entropy objective to a BLEU-inspired ob- Distance Dependencies. In Proc. of EMNLP, pages jective in a discriminative training regime, which we 1116–1127, Stroudsburg, PA, USA, July. Association hope to be more effective. We would also like to ap- for Computational Linguistics. ply recent advances in tackling the vanishing gradi- Anoop Deoras, Toma´sˇ Mikolov, Stefan Kombrink, ent problem (Pascanu et al., 2013) using a regular- M. Karafiat, and Sanjeev Khudanpur. 2011b. Varia- tional Approximation of Long-Span Language Models ization term to maintain the magnitude of the gradi- for LVCSR. In Proc. of ICASSP, pages 5532–5535. ents during back propagation through time. Finally, Ahmad Emami and Frederick Jelinek. 2005. A Neural we would like to integrate the recurrent model di- Syntactic Language Model. Machine Learning, 60(1- rectly into first-pass decoding, a straightforward ex- 3):195–227, September. tension of lattice rescoring using the algorithm we Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng. developed. 2013. Learning Semantic Representations for the Phrase Translation Model. Technical Report MSR- Acknowledgments TR-2013-88, Microsoft Research, September. Joshua Goodman. 2001. Classes for Fast Maximum En- We would like to thank Anthony Aue, Hany Has- tropy Training. In Proc. of ICASSP. san Awadalla, Jon Clark, Li Deng, Sauleh Eetemadi, Philipp Koehn, Franz Josef Och, and Daniel Marcu. Jianfeng Gao, Qin Gao, Xiaodong He, Will Lewis, 2003. Statistical Phrase-Based Translation. In Proc. Arul Menezes, and Kristina Toutanova for helpful of HLT-NAACL, pages 127–133, Edmonton, Canada, discussions related to this work as well as for com- May. ments on previous drafts. We would also like to Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, thank the anonymous reviewers for their comments. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Con- stantin, and Evan Herbst. 2007. Moses: Open Source References Toolkit for Statistical Machine Translation. In Proc. Alexandre Allauzen, Hel´ ene` Bonneau-Maynard, Hai-Son of ACL Demo and Poster Sessions, pages 177–180, Le, Aurelien´ Max, Guillaume Wisniewski, Franc¸ois Prague, Czech Republic, Jun. Yvon, Gilles Adda, Josep Maria Crego, Adrien Philipp Koehn. 2004. Statistical Significance Tests for Lardilleux, Thomas Lavergne, and Artem Sokolov. Machine Translation Evaluation. In Proc. of EMNLP, 2011. LIMSI @ WMT11. In Proc. of WMT, pages pages 388–395, Barcelona, Spain, Jul. 309–315, Edinburgh, Scotland, July. Association for Hai-Son Le, Alexandre Allauzen, and Franc¸ois Yvon. Computational Linguistics. 2012a. Continuous Space Translation Models with Neural Networks. In Proc. of HLT-NAACL, pages 39– Language Modeling for HLT, pages 11–19. Associa- 48, Montreal,´ Canada. Association for Computational tion for Computational Linguistics. Linguistics. Martin Sundermeyer, Ilya Oparin, Jean-Luc Gauvain, Hai-Son Le, Thomas Lavergne, Alexandre Allauzen, Ben Freiberg, Ralf Schluter,¨ and Hermann Ney. 2013. Marianna Apidianaki, Li Gong, Aurelien´ Max, Artem Comparison of Feedforward and Recurrent Neural Sokolov, Guillaume Wisniewski, and Franc¸ois Yvon. Network Language Models. In IEEE International 2012b. LIMSI @ WMT12. In Proc. of WMT, pages Conference on Acoustics, Speech, and Signal Process- 330–337, Montreal,´ Canada, June. Association for ing, pages 8430–8434, Vancouver, Canada, May. Computational Linguistics. Geoff Zweig and Konstantin Makarychev. 2013. Speed Wolfgang Macherey, Franz Josef Och, Ignacio Thayer, Regularization and Optimality in Word Classing. In and Jakob Uszkoreit. 2008. Lattice-based Minimum Proc. of ICASSP. Error Rate Training for Statistical Machine Transla- tion. In Proc. of EMNLP, pages 725–734, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Toma´sˇ Mikolov and Geoffrey Zweig. 2012. Con- text Dependent Recurrent Neural Network Language Model. In Proc. of Spoken Language Technologies (SLT), pages 234–239, Dec. Toma´sˇ Mikolov, Karafiat´ Martin, Luka´sˇ Burget, Jan Cer- nocky,´ and Sanjeev Khudanpur. 2010. Recurrent Neural Network based Language Model. In Proc. of INTERSPEECH, pages 1045–1048. Toma´sˇ Mikolov, Anoop Deoras, Daniel Povey, Luka´sˇ Burget, and Jan Cernockˇ y.´ 2011a. Strategies for Training Large Scale Neural Network Language Mod- els. In Proc. of ASRU, pages 196–201. Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan Cernocky,´ and Sanjeev Khudanpur. 2011b. Exten- sions of Recurrent Neural Network Language Model. In Proc. of ICASSP, pages 5528–5531. Toma´sˇ Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space- Word Representations. In Proc. of NAACL, pages 746–751, Stroudsburg, PA, USA, June. Association for Computational Linguistics. Toma´sˇ Mikolov. 2012. Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno Uni- versity of Technology. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL, pages 160–167, Sapporo, Japan, July. Razvan Pascanu, Tomas Mikolov, and . 2013. On the difficulty of training Recurrent Neural Networks. Proc. of ICML, abs/1211.5063. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning Internal Representations by Error Propagation. In Symposium on Parallel and Distributed Processing. Holger Schwenk, Anthony Rousseau, and Mohammed Attik. 2012. Large, Pruned or Continuous Space Lan- guage Models on a GPU for Statistical Machine Trans- lation. In NAACL-HLT Workshop on the Future of