Convolutional Neural Network Language Models

Convolutional Neural Network Language Models Ngoc-Quan Pham and German Kruszewski and Gemma Boleda Center for Mind/Brain Sciences University of Trento firstname.lastname @unitn.it { } Abstract ject recognition, scene parsing, and action recognition (Gu et al., 2015), but they have received less Convolutional Neural Networks (CNNs) have attention in NLP. They have been somewhat ex- shown to yield very strong results in several plored in static classification tasks where the model Computer Vision tasks. Their application to language has received much less attention, is provided with a full linguistic unit as input (e.g. a and it has mainly focused on static classifica- sentence) and classes are treated as independent of tion tasks, such as sentence classification for each other. Examples of this are sentence or docu- Sentiment Analysis or relation extraction. In ment classification for tasks such as Sentiment Anal- this work, we study the application of CNNs ysis or Topic Categorization (Kalchbrenner et al., to language modeling, a dynamic, sequential 2014; Kim, 2014), sentence matching (Hu et al., prediction task that needs models to capture 2014), and relation extraction (Nguyen and Grish- local as well as long-range dependency infor- man, 2015). However, their application to sequen- mation. Our contribution is twofold. First, we show that CNNs achieve 11-26% better tial prediction tasks, where the input is construed to absolute performance than feed-forward neu- be part of a sequence (for example, language model- ral language models, demonstrating their po- ing or POS tagging), has been rather limited (with tential for language representation even in se- exceptions, such as Collobert et al. (2011)). The quential tasks. As for recurrent models, our main contribution of this paper is a systematic evalu- model outperforms RNNs but is below state of ation of CNNs in the context of a prominent sequen- the art LSTM models. Second, we gain some tial prediction task, namely, language modeling. understanding of the behavior of the model, showing that CNNs in language act as feature Statistical language models are a crucial compo- detectors at a high level of abstraction, like in Computer Vision, and that the model can prof- nent in many NLP applications, such as Automatic itably use information from as far as 16 words Speech Recognition, Machine Translation, and In- before the target. formation Retrieval. Here, we study the problem under the standard formulation of learning to predict the upcoming token given its previous context. One 1 Introduction successful approach to this problem relies on count- Convolutional Neural Networks (CNNs) are the ing the number of occurrences of n-grams while family of neural network models that feature a type using smoothing and back-off techniques to esti- of layer known as the convolutional layer. This layer mate the probability of an upcoming word (Kneser can extract features by convolving a learnable filter and Ney, 1995). However, since each individual (or kernel) along different positions of a vectorial in- word is treated independently of the others, n-gram put. models fail to capture semantic relations between CNNs have been successfully applied in Com- words. In contrast, neural network language mod- puter Vision in many different tasks, including ob- els (Bengio et al., 2006) learn to predict the up- 1153 Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1153–1162, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics coming word given the previous context while em- sizes to learn complementary features at different bedding the vocabulary in a continuous space that aggregation levels. Kalchbrenner et al. (2014) pro- can represent the similarity structure between words. pose a convolutional architecture for sentence repre- Both feed-forward (Schwenk, 2007) and recurrent sentation that vertically stacks multiple convolution neural networks (Mikolov et al., 2010) have been layers, each of which can learn independent convo- shown to outperform n-gram models in various se- lution kernels. CNNs with similar structures have tups (Mikolov et al., 2010; Hai Son et al., 2011). also been applied to other classification tasks, such These two types of neural networks make different as semantic matching (Hu et al., 2014), relation ex- architectural decisions. Recurrent networks take one traction (Nguyen and Grishman, 2015), and infor- token at a time together with a hidden “memory” mation retrieval (Shen et al., 2014). In contrast, Col- vector as input and produce a prediction and an up- lobert et al. (2011) explore a CNN architecture to dated hidden vector for the next time step. In con- solve various sequential and non-sequential NLP trast, feed-forward language models take as input the tasks such as part-of-speech tagging, named entity last n tokens, where n is a fixed window size, and recognition and also language modeling. This is per- use them jointly to predict the upcoming word. haps the work that is closest to ours in the existing In this paper we define and explore CNN-based literature. However, their model differs from ours in language models and compare them with both feed- that it uses a max-pooling layer that picks the most forward and recurrent neural networks. Our results activated feature across time, thus ignoring tempo- show a 11-26% perplexity reduction of the CNN ral information, whereas we explicitly avoid doing with respect to the feed-forward language model, so. More importantly, the language models trained comparable or higher performance compared to in that work are only evaluated through downstream similarly-sized recurrent models, and lower perfor- tasks and through the quality of the learned word mance with respect to larger, state-of-the-art recur- embeddings, but not on the sequence prediction task rent language models (LSTMs as trained in Zaremba itself, as we do here. et al. (2014)). Besides being applied to word-based sequences, Our second contribution is an analysis of the kind the convolutional layers have also been used to of information learned by the CNN, showing that the model sequences at the character level. Kim et al. network learns to extract a combination of grammat- (2015) propose a recurrent language model that re- ical, semantic, and topical information from tokens places the word-indexed projection matrix with a of all across the input window, even those that are convolution layer fed with the character sequence the farthest from the target. that constitutes each word to find morphological pat- terns. The main difference between that work and 2 Related Work ours is that we consider words as the smallest linguistic unit, and thus apply the convolutional layer Convolutional Neural Networks (CNNs) were orig- at the word level. inally designed to deal with hierarchical representa- Statistical language modeling, the task we tackle, tion in Computer Vision (LeCun and Bengio, 1995). differs from most of the tasks where CNNs have Deep convolutional networks have been success- been applied before in multiple ways. First, the input fully applied in image classification and understand- typically consists of incomplete sequences of words ing (Simonyan and Zisserman, 2014; He et al., rather than complete sentences. Second, as a classi- 2015). In such systems the convolutional kernels fication problem, it features an extremely large num- learn to detect visual features at both local and more ber of classes (the words in a large vocabulary). Fi- abstract levels. nally, temporal information, which can be safely dis- In NLP, CNNs have been mainly applied to static carded in many settings with little impact in perfor- classification tasks for discovering latent structures mance, is critical here: An n-gram appearing close in text. Kim (2014) uses a CNN to tackle sentence to the predicted word may be more informative, or classification, with competitive results. The same yield different information, than the same n-gram work also introduces kernels with varying window appearing several tokens earlier. 1154 Highway 3 Models layer shared Our model is constructed by extending a feed- word forward language model (FFLM) with convolutional space Softmax . layers. In what follows, we first explain the imple- dropout dropout . w m j-1 r mentation of the base FFLM and then describe the o . f s n CNN model that we study. a P(wj = i|hj) r t wj-2 3.1 Baseline FFLM w Our baseline feed-forward language model (FFLM) j-3 . c is almost identical to the original model proposed a r r . y H H . by Bengio et al. (2006), with only slight changes to . I O . push its performance as high as we can, producing . wj-n+1 a very strong baseline. In particular, we extend it with highway layers and use Dropout as regulariza- 1 tion. The model is illustrated in Figure 1 and works as follows. First, each word in the input n-gram is mapped to a low-dimensional vector (viz. embed- Figure 1: Overview of baseline FFLM. ding) though a shared lookup table. Next, these word vectors are concatenated and fed to a highway layer (Srivastava et al., 2015). Highway layers im- prove the gradient flow of the network by computing the size of the input and k is the embedding size. as output a convex combination between its input This matrix is fed to a time-delayed layer, which (called the carry) and a traditional non-linear trans- convolves a sliding window of w input vectors cen- formation of it (called the transform). As a result, if tered on each word vector using a parameter matrix w k there is a neuron whose gradient cannot flow through W R × . Convolution is performed by taking ∈ the transform component (e.g., because the activa- the dot-product between the kernel matrix W and tion is zero), it can still receive the back-propagation each sub-matrix xi w/2:i+w/2 resulting in a scalar update signal through the carry gate.

Load more