Using the Output Embedding to Improve Language Models

Using the Output Embedding to Improve Language Models Ofir Press and Lior Wolf School of Computer Science Tel-Aviv University, Israel ofir.press,wolf @cs.tau.ac.il { } Abstract results in an activation vector h2 that similarly to H U >c, is also in IR . In this case, U and V are of We study the topmost weight matrix of exactly the same size. neural network language models. We show that this matrix constitutes a valid word embedding. When training language We call U the input embedding, and V the out- models, we recommend tying the input put embedding. In both matrices, we expect rows embedding and this output embedding. that correspond to similar words to be similar: for We analyze the resulting update rules and the input embedding, we would like the network show that the tied embedding evolves in to react similarly to synonyms, while in the out- a more similar way to the output embed- put embedding, we would like the scores of words ding than to the input embedding in the that are interchangeable to be similar (Mnih and untied model. We also offer a new method Teh, 2012). of regularizing the output embedding. Our methods lead to a significant reduction in While U and V can both serve as word embed- perplexity, as we are able to show on a va- dings, in the literature, only the former serves this riety of neural network language models. role. In this paper, we compare the quality of the Finally, we show that weight tying can re- input embedding to that of the output embedding, duce the size of neural translation models and we show that the latter can be used to improve to less than half of their original size with- neural network language models. Our main results out harming their performance. are as follows: (i) We show that in the word2vec skip-gram model, the output embedding is only 1 Introduction slightly inferior to the input embedding. This is In a common family of neural network language shown using metrics that are commonly used in or- models, the current input word is represented as der to measure embedding quality. (ii) In recurrent the vector c IRC and is projected to a dense neural network based language models, the output ∈ representation using a word embedding matrix U. embedding outperforms the input embedding. (iii) Some computation is then performed on the word By tying the two embeddings together, i.e., enforc- embedding U >c, which results in a vector of ac- ing U = V , the joint embedding evolves in a more tivations h2. A second matrix V then projects h2 similar way to the output embedding than to the in- to a vector h3 containing one score per vocabulary put embedding of the untied model. (iv) Tying the word: h3 = V h2. The vector of scores is then con- input and output embeddings leads to an improve- verted to a vector of probability values p, which ment in the perplexity of various language mod- represents the models’ prediction of the next word, els. This is true both when using dropout or when using the softmax function. not using it. (v) When not using dropout, we pro- For example, in the LSTM-based language pose adding an additional projection P before V , models of (Sundermeyer et al., 2012; Zaremba and apply regularization to P . (vi) Weight tying et al., 2014), for vocabulary of size C, the one- in neural translation models can reduce their size hot encoding is used to represent the input c and (number of parameters) to less than half of their C H U IR × . An LSTM is then employed, which original size without harming their performance. ∈ 157 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics 2 Related Work on the model’s performance was not tested. In- dependently and concurrently with our work (Inan Neural network language models (NNLMs) assign et al., 2016) presented an explanation for weight probabilities to word sequences. Their resurgence tying in NNLMs based on (Hinton et al., 2015). was initiated by (Bengio et al., 2003). Recur- rent neural networks were first used for language 3 Weight Tying modeling in (Mikolov et al., 2010) and (Pascanu et al., 2013). The first model that implemented In this work, we employ three different model cat- language modeling with LSTMs (Hochreiter and egories: NNLMs, the word2vec skip-gram model, Schmidhuber, 1997) was (Sundermeyer et al., and NMT models. Weight tying is applied sim- 2012). Following that, (Zaremba et al., 2014) in- ilarly in all models. For translation models, we troduced a dropout (Srivastava, 2013) augmented also present a three-way weight tying method. NNLM. (Gal, 2015; Gal and Ghahramani, 2016) NNLM models contain an input embedding ma- proposed a new dropout method, which is referred trix, two LSTM layers (h1 and h2), a third hidden to as Bayesian Dropout below, that improves on scores/logits layer h3, and a softmax layer. The the results of (Zaremba et al., 2014). loss used during training is the cross entropy loss The skip-gram word2vec model introduced without any regularization terms. in (Mikolov et al., 2013a; Mikolov et al., 2013b) Following (Zaremba et al., 2014), we employ learns representations of words. This model learns two models: large and small. The large model em- a representation for each word in its vocabulary, ploys dropout for regularization. The small model both in an input embedding matrix and in an out- is not regularized. Therefore, we propose the following regularization scheme. A projection matrix put embedding matrix. When training is com- H H P IR × is inserted before the output embed- plete, the vectors that are returned are the input ∈ embeddings. The output embedding is typically ding, i.e., h3 = V P h2. The regularizing term λ P is then added to the small model’s loss ignored, although (Mitra et al., 2016; Mnih and k k2 Kavukcuoglu, 2013) use both the output and input function. In all of our experiments, λ = 0.15. embeddings of words in order to compute word Projection regularization allows us to use the similarity. Recently, (Goldberg and Levy, 2014) same embedding (as both the input/output embed- argued that the output embedding of the word2vec ding) with some adaptation that is under regular- skip-gram model needs to be different than the in- ization. It is, therefore, especially suited for WT. put embedding. While training a vanilla untied NNLM, at As we show, tying the input and the output em- timestep t, with current input word sequence beddings is indeed detrimental in word2vec. How- i1:t = [i1, ..., it] and current target output word ever, it improves performance in NNLMs. ot, the negative log likelihood loss is given by: = log p (o i ) p (o i ) = In neural machine translation (NMT) mod- t t t 1:t , where t t 1:t L exp (V −h(t)) | | o>t 2 els (Kalchbrenner and Blunsom, 2013; Cho et PC (t) , Uk (Vk) is the kth row of U (V ), x=1 exp(Vx>h2 ) al., 2014; Sutskever et al., 2014; Bahdanau et (t) al., 2014), the decoder, which generates the trans- which corresponds to word k, and h2 is the vector lation of the input sentence in the target lan- of activations of the topmost LSTM layer’s output guage, is a language model that is conditioned on at time t. For simplicity, we assume that at each timestep t, i = o . Optimization of the model is both the previous words of the output sentence t 6 t and on the source sentence. State of the art re- performed using stochastic gradient descent. sults in NMT have recently been achieved by sys- The update for row k of the input embedding is: tems that segment the source and target words ( (t) PC ∂h2 ∂ t ( x=1 pt(x i1:t) Vx> Vo>t ) ∂U k = it into subword units (Sennrich et al., 2016a). One L = | · − it ∂U k 0 k = it such method (Sennrich et al., 2016b) is based on 6 the byte pair encoding (BPE) compression algo- For the output embedding, row k’s update is: ( (t) rithm (Gage, 1994). BPE segments rare words into ∂ t (pt(ot i1:t) 1)h2 k = ot L = | −(t) ∂Vk pt(k i1:t) h k = ot their more commonly appearing subwords. | · 2 6 Weight tying was previously used in the log- Therefore, in the untied model, at every timestep, bilinear model of (Mnih and Hinton, 2009), but the the only row that is updated in the input embed- decision to use it was not explained, and its effect ding is the row Uit representing the current input 158 word. This means that vectors representing rare Language Subwords Subwords Subwords pairs only in source only in target in both words are updated only a small number of times. EN FR 2K 7K 85K The output embedding updates every row at each EN→DE 3K 11K 80K → timestep. In tied NNLMs, we set U = V = S. The Table 1: Shared BPE subwords between pairs of languages. update for each row in S is the sum of the updates obtained for the two roles of S as both an input and previous state at each timestep. ct is the context output embedding. vector at timestep t, ct = j r atjhj, where atj The update for row k = i is similar to the up- ∈ 6 t is the weight given to the jth annotation at time t: date of row k in the untied NNLM’s output embed- exp(etj ) P atj = P exp(e ) , and etj = at(hj), where a is ding (the only difference being that U and V are k r ik the alignment∈ model.

Load more