Character-Word LSTM Language Models
Total Page:16
File Type:pdf, Size:1020Kb
Character-Word LSTM Language Models Lyan Verwimp Joris Pelemans Hugo Van hamme Patrick Wambacq ESAT – PSI, KU Leuven Kasteelpark Arenberg 10, 3001 Heverlee, Belgium [email protected] Abstract A first drawback is the fact that the parameters for infrequent words are typically less accurate because We present a Character-Word Long Short- the network requires a lot of training examples to Term Memory Language Model which optimize the parameters. The second and most both reduces the perplexity with respect important drawback addressed is the fact that the to a baseline word-level language model model does not make use of the internal structure and reduces the number of parameters of the words, given that they are encoded as one-hot of the model. Character information can vectors. For example, ‘felicity’ (great happiness) is reveal structural (dis)similarities between a relatively infrequent word (its frequency is much words and can even be used when a word lower compared to the frequency of ‘happiness’ is out-of-vocabulary, thus improving the according to Google Ngram Viewer (Michel et al., modeling of infrequent and unknown words. 2011)) and will probably be an out-of-vocabulary By concatenating word and character (OOV) word in many applications, but since there embeddings, we achieve up to 2.77% are many nouns also ending on ‘ity’ (ability, com- relative improvement on English compared plexity, creativity . ), knowledge of the surface to a baseline model with a similar amount of form of the word will help in determining that ‘felic- parameters and 4.57% on Dutch. Moreover, ity’ is a noun. Hence, subword information can play we also outperform baseline word-level an important role in improving the representations models with a larger number of parameters. for infrequent words and even OOV words. In our character-word (CW) LSTM LM, we 1 Introduction concatenate character and word embeddings and Language models (LMs) play a crucial role in feed the resulting character-word embedding to the many speech and language processing tasks, among LSTM. Hence, we provide the LSTM with infor- others speech recognition, machine translation and mation about the structure of the word. By concate- optical character recognition. The current state of nating the embeddings, the individual characters the art are recurrent neural network (RNN) based (as opposed to e.g. a bag-of-characters approach) LMs (Mikolov et al., 2010), and more specifically are preserved and the order of the characters is im- long short-term memory models (LSTM) (Hochre- plicitly modeled. Moreover, since we keep the total iter and Schmidhuber, 1997) LMs (Sundermeyer embedding size constant, the ‘word’ embedding et al., 2012) and their variants (e.g. gated recurrent shrinks in size and is partly replaced by character units (GRU) (Cho et al., 2014)). LSTMs and GRUs embeddings (with a much smaller vocabulary and are usually very similar in performance, with GRU hence a much smaller embedding matrix), which models often even outperforming LSTM models decreases the number of parameters of the model. despite the fact that they have less parameters to We investigate the influence of the number of train. However, Jozefowicz et al. (2015) recently characters added, the size of the character embed- showed that for the task of language modeling dings, weight sharing for the characters and the size LSTMs work better than GRUs, therefore we focus of the (hidden layer of the) model. Given that com- on LSTM-based LMs. mon or similar character sequences do not always In this work, we address some of the drawbacks occur at the beginning of words (e.g. ‘overfitting’ of NN based LMs (and many other types of LMs). – ‘underfitting’), we also examine adding the charac- 417 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 417–427, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics ters in forward order, backward order or both orders. Research on replacing the word embeddings We test our CW LMs on both English and entirely has been done for neural machine transla- Dutch. Since Dutch has a richer morphology tion (NMT) by Ling et al. (2015) and Costa-jussa` than English due to among others its productive and Fonollosa (2016), who replace word-level compounding (see e.g. (Reveil,´ 2012)), we expect embeddings with character-level embeddings. that it should benefit more from a LM augmented Chung et al. (2016) use a subword-level encoder and with formal/morphological information. a character-level decoder for NMT. In dependency The contributions of this paper are the following: parsing, Ballesteros et al. (2015) achieve improve- ments by generating character-level embeddings 1. We present a method to combine word with a bidirectional LSTM. Xie et al. (2016) work and subword information in an LSTM LM: on natural language correction and also use an concatenating word and character embeddings. encoder-decoder, but operate for both the encoder As far as we know, this method has not been and the decoder on the character level. investigated before. Character-level word representations can also 2. By decreasing the size of the word-level em- be generated with convolutional neural networks bedding (and hence the huge word embedding (CNNs), as Zhang et al. (2015) and Kim et al. (2016) matrix), we effectively reduce the number of have proven for text classification and language parameters in the model (see section 3.3). modeling respectively. Kim et al. (2016) achieve state-of-the-art results in language modeling for 3. We find that the CW model both outperforms several languages by combining a character-level word-level LMs with the same number of CNN with highway (Srivastava et al., 2015) and hidden units (and hence a larger number of LSTM layers. However, the major improvement parameters) and word-level LMs with the is achieved by adding the highway layers: for a same number of parameters. These findings small model size, the purely character-level model are confirmed for English and Dutch, for a without highway layers does not perform better small model size and a large model size. The than the word-level model (perplexity of 100.3 size of the character embeddings should be compared to 97.6), even though the character model proportional to the total size of the embedding has two hidden layers of 300 LSTM units each and (the concatenation of characters should not is compared to a word model of two hidden layers exceed the size of the word-level embedding), of only 200 units (in order to keep the number and using characters in the backward order of parameters similar). For a model of larger improves the perplexity even more (see size, the character-level LM improves the word sections 3.1, 4.3 and 4.4). baseline (84.6 compared to 85.4), but the largest 4. The LM improves the modeling of OOV improvement is achieved by adding two highway words by exploiting their surface form (see layers (78.9). Finally, Jozefowicz et al. (2016) also section 4.7). describe character embeddings generated by a CNN, but they test on the 1B Word Benchmark, a data set The remainder of this paper is structured as of an entirely different scale than the one we use. follows: first, we discuss related work (section 2); Other authors combine the word and character then the CW LSTM LM is described (section 3) and information (as we do in this paper) rather than tested (section 4). Finally, we give an overview of doing away completely with word inputs. Chen et the results and an outlook to future work (section 5). al. (2015) and Kang et al. (2011) work on models 2 Related work combining words and Chinese characters to learn embeddings. Note however that Chinese characters Other work that investigates the use of character in- more closely match subwords or words than formation in RNN LMs either completely replaces phonemes. Bojanowski et al. (2015) operate on the the word-level representation by a character-level character level but use knowledge about the context one or combines word and character information. words in two variants of character-level RNN LMs. Much research has also been done on modeling Dos Santos and Zadrozny (2014) join word and other types of subword information (e.g. mor- character representations in a deep neural network phemes, syllables), but in this discussion, we limit for part-of-speech tagging. Finally, Miyamoto and ourselves to characters as subword information. 418 Cho (2016) describe a LM that is related to our model, although their character-level embedding is generated by a bidirectional LSTM and we do embedding LSTM not use a gate to determine how much of the word and how much of the character embedding is used. However, they only compare to a simple baseline e.g. cat wt model of 2 LSTM layers of each 200 hidden units without dropout, resulting in a higher baseline perplexity (as mentioned in section 4.3, our CW model also achieves larger improvements than 1 c ct h reported in this paper with respect to that baseline). t 2 We can conclude that in various NLP tasks, char- a ct acters have recently been introduced in several dif- t 3 ferent manners. However, the models investigated ct in related work are either not tested on a competitive baseline (Miyamoto and Cho, 2016) or do not per- ht 1 form better than our models (Kim et al., 2016). In − this paper, we introduce a new and straightforward manner to incorporate characters in a LM that (as Figure 1: Concatenating word and character far as we know) has not been investigated before.