Recurrent Neural Network for Text Classification with Multi-Task
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Recurrent Neural Network for Text Classification with Multi-Task Learning Pengfei Liu Xipeng Qiu⇤ Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China pfliu14,xpqiu,xjhuang @fudan.edu.cn { } Abstract are based on unsupervised objectives such as word predic- tion for training [Collobert et al., 2011; Turian et al., 2010; Neural network based methods have obtained great Mikolov et al., 2013]. This unsupervised pre-training is effec- progress on a variety of natural language process- tive to improve the final performance, but it does not directly ing tasks. However, in most previous works, the optimize the desired task. models are learned based on single-task super- vised objectives, which often suffer from insuffi- Multi-task learning utilizes the correlation between related cient training data. In this paper, we use the multi- tasks to improve classification by learning tasks in parallel. [ task learning framework to jointly learn across mul- Motivated by the success of multi-task learning Caruana, ] tiple related tasks. Based on recurrent neural net- 1997 , there are several neural network based NLP models [ ] work, we propose three different mechanisms of Collobert and Weston, 2008; Liu et al., 2015b utilize multi- sharing information to model text with task-specific task learning to jointly learn several tasks with the aim of and shared layers. The entire network is trained mutual benefit. The basic multi-task architectures of these jointly on all these tasks. Experiments on four models are to share some lower layers to determine common benchmark text classification tasks show that our features. After the shared layers, the remaining layers are proposed models can improve the performance of a split into the multiple specific tasks. task with the help of other related tasks. In this paper, we propose three different models of sharing information with recurrent neural network (RNN). All the re- lated tasks are integrated into a single system which is trained 1 Introduction jointly. The first model uses just one shared layer for all the tasks. The second model uses different layers for different Distributed representations of words have been widely used tasks, but each layer can read information from other layers. in many natural language processing (NLP) tasks. Follow- The third model not only assigns one specific layer for each ing this success, it is rising a substantial interest to learn task, but also builds a shared layer for all the tasks. Besides, the distributed representations of the continuous words, such we introduce a gating mechanism to enable the model to se- as phrases, sentences, paragraphs and documents [Socher et lectively utilize the shared information. The entire network is al., 2013; Le and Mikolov, 2014; Kalchbrenner et al., 2014; trained jointly on all these tasks. Liu et al., 2015a]. The primary role of these models is to rep- resent the variable-length sentence or document as a fixed- Experimental results on four text classification tasks show length vector. A good representation of the variable-length that the joint learning of multiple related tasks together can text should fully capture the semantics of natural language. improve the performance of each task relative to learning The deep neural networks (DNN) based methods usually them separately. need a large-scale corpus due to the large number of parame- Our contributions are of two-folds: ters, it is hard to train a network that generalizes well with limited data. However, the costs are extremely expensive First, we propose three multi-task architectures for to build the large scale resources for some NLP tasks. To • deal with this problem, these models often involve an un- RNN. Although the idea of multi-task learning is not supervised pre-training phase. The final model is fine-tuned new, our work is novel to integrate RNN into the multi- with respect to a supervised training criterion with a gradient learning framework, which learns to map arbitrary text based optimization. Recent studies have demonstrated signif- into semantic vector representations with both task- icant accuracy gains in several NLP tasks [Collobert et al., specific and shared layers. 2011] with the help of the word representations learned from the large unannotated corpora. Most pre-training methods Second, we demonstrate strong results on several text • classification tasks. Our multi-task models outperform ⇤Corresponding author. most of state-of-the-art baselines. 2873 2 Recurrent Neural Network for Specific-Task Text Classification h h h h softmax y 1 2 3 · ·· T The primary role of the neural models is to represent the variable-length text as a fixed-length vector. These models generally consist of a projection layer that maps words, sub- x1 x2 x3 xT word units or n-grams to vector representations (often trained beforehand with unsupervised methods), and then combine Figure 1: Recurrent Neural Network for Classification them with the different architectures of neural networks. There are several kinds of models to model text, such as are the following: Neural Bag-of-Words (NBOW) model, recurrent neural net- [ ] it = σ(Wixt + Uiht 1 + Vict 1), (2) work (RNN) Chung et al., 2014 , recursive neural network − − (RecNN) [Socher et al., 2012; 2013] and convolutional neural ft = σ(Wf xt + Uf ht 1 + Vf ct 1), (3) network (CNN) [Collobert et al., 2011; Kalchbrenner et al., − − ot = σ(Woxt + Uoht 1 + Voct), (4) 2014]. These models take as input the embeddings of words − c˜t = tanh(Wcxt + Ucht 1), (5) in the text sequence, and summarize its meaning with a fixed − i length vectorial representation. ct = ft ct 1 + it c˜t, (6) Among them, recurrent neural networks (RNN) are one − ht = ot tanh(ct), (7) of the most popular architectures used in NLP problems be- cause their recurrent structure is very suitable to process the where xt is the input at the current time step, σ denotes the variable-length text. logistic sigmoid function and denotes elementwise multi- plication. Intuitively, the forget gate controls the amount of 2.1 Recurrent Neural Network which each unit of the memory cell is erased, the input gate A recurrent neural network (RNN) [Elman, 1990] is able to controls how much each unit is updated, and the output gate process a sequence of arbitrary length by recursively applying controls the exposure of the internal memory state. h a transition function to its internal hidden state vector t of 2.2 Task-Specific Output Layer the input sequence. The activation of the hidden state ht at time-step t is computed as a function f of the current input In a single specific task, a simple strategy is to map the input sequence to a fixed-sized vector using one RNN, and then to symbol xt and the previous hidden state ht 1 − feed the vector to a softmax layer for classification or other 0 t =0 tasks. ht = (1) Given a text sequence x = x ,x , ,x , we first use a f(ht 1, xt) otherwise { 1 2 ··· T } ⇢ − lookup layer to get the vector representation (embeddings) xi It is common to use the state-to-state transition function f of the each word xi. The output at the last moment hT can be as the composition of an element-wise nonlinearity with an regarded as the representation of the whole sequence, which affine transformation of both xt and ht 1. has a fully connected layer followed by a softmax non-linear − Traditionally, a simple strategy for modeling sequence is layer that predicts the probability distribution over classes. to map the input sequence to a fixed-sized vector using one Figure 1 shows the unfolded RNN structure for text classi- RNN, and then to feed the vector to a softmax layer for clas- fication. sification or other tasks [Cho et al., 2014]. The parameters of the network are trained to minimise the Unfortunately, a problem with RNNs with transition func- cross-entropy of the predicted and true distributions. tions of this form is that during training, components of the gradient vector can grow or decay exponentially over long N C j j sequences [Hochreiter et al., 2001; Hochreiter and Schmid- L(yˆ, y)= y log(yˆ ), (8) − i i ] i=1 j=1 huber, 1997 . This problem with exploding or vanishing gra- X X dients makes it difficult for the RNN model to learn long- j j where yi is the ground-truth label; yˆi is prediction probabil- distance correlations in a sequence. ities; N denotes the number of training samples and C is the Long short-term memory network (LSTM) was proposed class number. by [Hochreiter and Schmidhuber, 1997] to specifically ad- dress this issue of learning long-term dependencies. The 3 Three Sharing Models for RNN based LSTM maintains a separate memory cell inside it that up- dates and exposes its content only when deemed necessary. Multi-Task Learning A number of minor modifications to the standard LSTM unit Most existing neural network methods are based on super- have been made. While there are numerous LSTM variants, vised training objectives on a single task [Collobert et al., here we describe the implementation used by Graves [2013]. 2011; Socher et al., 2013; Kalchbrenner et al., 2014]. These We define the LSTM units at each time step t to be a col- methods often suffer from the limited amounts of training d lection of vectors in R : an input gate it,aforget gate ft, an data. To deal with this problem, these models often involve output gate ot,amemory cell ct and a hidden state ht.