Character N-Gram Embeddings to Improve RNN Language Models
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Character n-Gram Embeddings to Improve RNN Language Models Sho Takase,y∗ Jun Suzuki,yz Masaaki Nagatay yNTT Communication Science Laboratories zTohoku University [email protected], [email protected], [email protected] Abstract To incorporate the internal structure, (Verwimp et al. 2017) concatenated character embeddings with an input word em- This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. bedding. They demonstrated that incorporating character em- We focus on character n-grams based on research in the field of beddings improved the performance of RNN language mod- word embedding construction (Wieting et al. 2016). Our pro- els. Moreover, (Kim et al. 2016) and (Jozefowicz et al. 2016) posed method constructs word embeddings from character n- applied Convolutional Neural Networks (CNN) to construct gram embeddings and combines them with ordinary word em- word embeddings from character embeddings. beddings. We demonstrate that the proposed method achieves On the other hand, in the field of word embedding construc- the best perplexities on the language modeling datasets: Penn tion, some previous researchers found that character n-grams Treebank, WikiText-2, and WikiText-103. Moreover, we con- are more useful than single characters (Wieting et al. 2016; duct experiments on application tasks: machine translation and Bojanowski et al. 2017). In particular, (Wieting et al. 2016) headline generation. The experimental results indicate that our demonstrated that constructing word embeddings from char- proposed method also positively affects these tasks. acter n-gram embeddings outperformed the methods that construct word embeddings from character embeddings by 1 Introduction using CNN or a Long Short-Term Memory (LSTM). Neural language models have played a crucial role in re- Based on their reports, in this paper, we propose a neural cent advances of neural network based methods in natural language model that utilizes character n-gram embeddings. language processing (NLP). For example, neural encoder- Our proposed method encodes character n-gram embeddings decoder models, which are becoming the de facto standard to a word embedding with simplified Multi-dimensional Self- for various natural language generation tasks including ma- attention (MS) (Shen et al. 2018). We refer to this constructed chine translation (Sutskever, Vinyals, and Le 2014), summa- embedding as charn-MS-vec. The proposed method regards rization (Rush, Chopra, and Weston 2015), dialogue (Wen et charn-MS-vec as an input in addition to a word embedding. al. 2015), and caption generation (Vinyals et al. 2015) can be We conduct experiments on the well-known benchmark interpreted as conditional neural language models. Moreover, datasets: Penn Treebank, WikiText-2, and WikiText-103. Our neural language models can be used for rescoring outputs experiments indicate that the proposed method outperforms from traditional methods, and they significantly improve the neural language models trained with well-tuned hyperparam- performance of automatic speech recognition (Du et al. 2016). eters and achieves state-of-the-art scores on each dataset. In This implies that better neural language models improve the addition, we incorporate our proposed method into a standard performance of application tasks. neural encoder-decoder model and investigate its effect on In general, neural language models require word embed- machine translation and headline generation. We indicate that dings as an input (Zaremba, Sutskever, and Vinyals 2014). the proposed method also has a positive effect on such tasks. However, as described by (Verwimp et al. 2017), this ap- proach cannot make use of the internal structure of words 2 RNN Language Model although the internal structure is often an effective clue for In this study, we focus on RNN language models, which are considering the meaning of a word. For example, we can widely used in the literature. This section briefly overviews comprehend that the word ‘causal’ is related to ‘cause’ im- the basic RNN language model. mediately because both words include the same character In language modeling, we compute joint probability by sequence ‘caus’. Thus, if we incorporate a method that han- using the product of conditional probabilities. Let w1:T be dles the internal structure such as character information, we a word sequence with length T , namely, w1; :::; wT . We for- can improve the quality of neural language models and prob- mally obtain the joint probability of word sequence w1:T as ably make them robust to infrequent words. follows: T −1 ∗Current affiliation: Tokyo Institute of Technology. Y Copyright c 2019, Association for the Advancement of Artificial p(w1:T ) = p(w1) p(wt+1jw1:t): (1) Intelligence (www.aaai.org). All rights reserved. t=1 5074 Notationtemporal DeWc p(w1) is generally assumed to be 1 in this literature, i.e., 1 to the traditional epoxidian p(w1)=1, and thus we can ignore its calculation . To estimate the conditional probability p(wt+1jw1:t), RNN language models encode sequence w1:t into a fixed- RNN (LSTM) length vector and compute the probability distribution of each … … word from this fixed-length vector. Let V be the vocabulary V size and let Pt 2 R be the probability distribution of the word char3-MS-vec vocabulary at timestep t. Moreover, let Dh be the dimension embedding d of the hidden state of an RNN and let De be the dimensions of embedding vectors. Then, RNN language models predict Multi-dimensional x 1 the probability distribution Pt+1Weightedby the following equation: Self-attention (MS) weighted x vectors 2 P = softmax(W h + b); character t+1 t (2) 3-gram h = f(e ; h ); … … (3) t t t−1 embeddings xL et = Ext; (4) where W 2 RV ×Dh is a weight matrix, b 2 RV is a bias term, Multiplied by De×V V Multiplied by and E 2 R is a word embedding matrix. xt 2 f0; 1g softmax function Dh and ht 2 R are a one-hot vector of an input word wt and to each row the hidden state of the RNN at timestep t, respectively. We character define h at timestep t = 0 as a zero vector, that is, h = 0. 3-gram Softmax function t 0 multiplied by Let f(·) represent… an abstract function of an RNN, which embeddings to each column Matrix might beEach the column LSTM, is the Quasi-Recurrent Neural Network (^: BOW (size = L × d) (QRNN)a (Bradburydistribution etby al. 2017), or any other RNN variants. $: EOW) softmax function 3 Incorporating Character n-gram alternative to the traditional epoxidian Embeddings We incorporate charn-MS-vec, which is an embedding con- structed from character n-gram embeddings, into RNN lan- Figure 1: Overview of the proposed method. The proposed guage models since, as discussed earlier, previous studies method computes charn-MS-vec from character n-gram (3- revealed that we can construct better word embeddings by gram in this figure) embeddings and inputs the sum of it and using character n-gram embeddings (Wieting et al. 2016; the standard word embedding into an RNN. Bojanowski et al. 2017). In particular, we expect charn-MS- vec to help represent infrequent words by taking advantage of the internal structure. the number of character n-grams extracted from the word, Figure 1 is the overview of the proposed method using and let S be the matrix whose i-th column corresponds to si, character 3-gram embeddings (char3-MS-vec). As illustrated that is, S = [s1; :::; sI ]. The multi-dimensional self-attention in this figure, our proposed method regards the sum of char3- constructs the word embedding ct by the following equations: MS-vec and the standard word embedding as an input of an I RNN. In other words, let c be charn-MS-vec and we replace X t ct = gi si; (6) Equation 4 with the following: i=1 T et = Ext + ct: (5) fgigj = fsoftmax( (WcS) j)gi; (7) 3.1 Multi-dimensional Self-attention where means element-wise product of vectors, Wc 2 De×De is a weight matrix, [·] is the j-th column of a given To compute ct, we apply an encoder to character n-gram em- R j beddings. Previous studies demonstrated that additive com- matrix, and {·}j is the j-th element of a given vector. In position, which computes the (weighted) sum of embeddings, short, Equation 7 applies the softmax function to each row of is a suitable method for embedding construction (Takase, [WcS] and extracts the i-th column as gi. Okazaki, and Inui 2016; Wieting et al. 2016). Thus, we Let us consider the case where an input word is ‘the’ and adopt (simplified) multi-dimensional self-attention (Shen et we use character 3-gram in Figure 1. We prepare special al. 2018), which computes weights for each dimension of characters ‘ˆ’ and ‘$’ to represent the beginning and end of the given embeddings and sums up the weighted embeddings word, respectively. Then, ‘the’ is composed of three character (i.e., element-wise weighted sum) as an encoder. Let si be 3-grams: ‘ˆth’, ‘the’, and ‘he$’. We multiply the embeddings the character n-gram embeddings of an input word, let I be of these 3-grams by transformation matrix Wc and apply the softmax function to each row2 as in Equation 7. As a result 1This definition is based on published implementations of previous studies such as https://github.com/wojzaremba/lstm and 2(Shen et al. 2018) applied an activation function and one more https://github.com/salesforce/awd-lstm-lm. transformation matrix to embeddings but we omit these operations. 5075 PTB WT2 WT103 words at every step. Therefore, we did not use C for word tying.