Sentence Modeling with Gated Recursive Neural Network

Xinchi Chen, Xipeng Qiu,∗ Chenxi Zhu, Shiyu Wu, Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China xinchichen13,xpqiu,czhu13,syu13,xjhuang @fudan.edu.cn { }

Abstract

Recently, neural network based sentence modeling methods have achieved great progress. Among these methods, the re- I cannot agree with you more I cannot agree with you more cursive neural networks (RecNNs) can ef- fectively model the combination of the Figure 1: Example of Gated Recursive Neural words in sentence. However, RecNNs Networks (GRNNs). Left is a GRNN using a di- need a given external topological struc- rected acyclic graph (DAG) structure. Right is a ture, like syntactic tree. In this paper, GRNN using a full binary tree (FBT) structure. we propose a gated recursive neural net- (The green nodes, gray nodes and white nodes work (GRNN) to model sentences, which illustrate the positive, negative and neutral senti- employs a full binary tree (FBT) struc- ments respectively.) ture to control the combinations in re- cursive structure. By introducing two kinds of gates, our model can better model to model sentences. However, DAG structure is the complicated combinations of features. relatively complicated. The number of the hidden Experiments on three text classification neurons quadraticly increases with the length of datasets show the effectiveness of our sentences so that grConv cannot effectively deal model. with long sentences. Inspired by grConv, we propose a gated recur- 1 Introduction sive neural network (GRNN) for sentence model- ing. Different with grConv, we use the full binary Recently, neural network based sentence modeling tree (FBT) as the topological structure to recur- approaches have been increasingly focused on for sively model the word combinations, as shown in their ability to minimize the efforts in feature en- Figure 1. The number of the hidden neurons lin- gineering, such as Neural Bag-of-Words (NBoW), early increases with the length of sentences. An- (RNN) (Mikolov et al., other difference is that we introduce two kinds of 2010), Recursive Neural Network (RecNN) (Pol- gates, reset and update gates (Chung et al., 2014), lack, 1990; Socher et al., 2013b; Socher et al., to control the combinations in recursive structure. 2012) and Convolutional Neural Network (CNN) With these two gating mechanisms, our model can (Kalchbrenner et al., 2014; Hu et al., 2014). better model the complicated combinations of fea- Among these methods, recursive neural net- tures and capture the long dependency interac- works (RecNNs) have shown their excellent abil- tions. ities to model the word combinations in sentence. In our previous works, we have investigated However, RecNNs require a pre-defined topolog- several different topological structures (tree and ical structure, like parse tree, to encode sentence, directed acyclic graph) to recursively model the which limits the scope of its application. Cho et al. semantic composition from the bottom layer to the (2014) proposed the gated recursive convolutional top layer, and applied them on Chinese word seg- neural network (grConv) by utilizing the directed mentation (Chen et al., 2015a) and dependency acyclic graph (DAG) structure instead of parse tree parsing (Chen et al., 2015b) tasks. However, these ∗Corresponding author. structures are not suitable for modeling sentences.

793 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 793–798, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics.

… P(·|xi;θ) (l) hj Softmax(Ws × ui + bs) Gate z ui

^(l) hj

Gate rL Gate rR

(l-1) (l-1) 0 0 0 h2j h2j+1

Figure 3: Our proposed gated recursive unit. … … … … … … … …

left child and right child respectively. And the up- (i) (i) (i) (i) (i) date gates zN , zL and zR decide what to preserve w1 w2 w3 w4 w5 when combining the children’s information. Intu- Figure 2: Architecture of Gated Recursive Neural itively, these gates seem to decide how to update Network (GRNN). and exploit the combination information. In the case of text classification, for each given sentence x = w(i) and the corresponding class In this paper, we adopt the full binary tree as the i 1:N(i) (i) topological structure to reduce the model com- yi, we first represent each word wj into its corre- d plexity. sponding embedding w (i) R , where N(i) in- wj ∈ Experiments on the Stanford Sentiment Tree- dicates the length of i-th sentence and d is dimen- bank dataset (Socher et al., 2013b) and the TREC sionality of word embeddings. Then, the embed- questions dataset (Li and Roth, 2002) show the ef- dings are sent to the first layer of GRNN as inputs, fectiveness of our approach. whose outputs are recursively applied to upper lay- ers until it outputs a single fixed-length vector. 2 Gated Recursive Neural Network Next, we receive the class distribution P( x ; θ) ·| i 2.1 Architecture for the given sentence xi by a softmax transforma- u u The recursive neural network (RecNN) need a tion of i, where i is the top node of the network topological structure to model a sentence, such as (a fixed length vectorial representation): a syntactic tree. In this paper, we use a full binary P( x ; θ) = softmax(W u + b ), (1) tree (FBT), as showing in Figure 2, to model the ·| i s × i s combinations of features for a given sentence. T T d where bs R| |, Ws R| |× . d is the dimen- In fact, the FBT structure can model the com- ∈ ∈ sionality of the top node ui, which is same with binations of features by continuously mixing the the size and T represents the set information from the bottom layer to the top layer. of possible classes. θ represents the parameter set. Each neuron can be regarded as a complicated feature composition of its governed sub-sentence. 2.2 Gated Recursive Unit When the children nodes combine into their parent GRNN consists of the minimal structures, gated node, the combination information of two children recursive units, as showing in Figure 3. nodes is also merged and preserved by their par- By assuming that the length of sentence is n, we ent node. As shown in Figure 2, we put all-zero will have recursion layer l [1, logn +1], where padding vectors after the last word of the sentence ∈ ⌈ 2 ⌉ n symbol q indicates the minimal integer q∗ q. log2 ⌈ ⌉ ≥ until the length of 2⌈ ⌉, where n is the length of At each recursion layer l, the activation of the j- the given sentence. logn l (l) d th (j [0, 2 2 − )) hidden node h R is Inspired by the success of the gate mechanism ∈ ⌈ ⌉ j ∈ computed as of Chung et al. (2014), we further propose a gated recursive neural network (GRNN) by introducing { ˆl l 1 l 1 two kinds of gates, namely “reset gate” and “up- (l) zN hj + zL h − + zR h − , l > 1, h = ⊙ ⊙ 2j ⊙ 2j+1 date gate”. Specifically, there are two reset gates, j corresponding word embedding, l = 1, (2) rL and rR, partially reading the information from

794 d where zN , zL and zR R are update gates Initial learning rate α = 0.3 l ∈ 4 ˆ l 1 Regularization λ = 10− for new activation hj, left child node h2−j and l 1 Dropout rate on input layer p = 20% right child node h − respectively, and indi- 2j+1 ⊙ cates element-wise multiplication. Table 1: Hyper-parameter settings. The update gates can be formalized as:       ˆl zN 1/Z  hj      l 1 where m is number of training sentences. z = zL = 1/Z exp(U  h − ), (3) ⊙ 2j zR 1/Z l 1 Following (Socher et al., 2013a), we use the di- h2−j+1 agonal variant of AdaGrad (Duchi et al., 2011) 3d 3d where U R × is the coefficient of update with minibatchs to minimize the objective. ∈ gates, and Z Rd is the vector of the normaliza- For parameter initialization, we use random ini- ∈ tion coefficients, tialization within (-0.01, 0.01) for all parameters except the word embeddings. We adopt the pre- ˆl 3 hj trained English word embeddings from (Collobert l 1 Zk = [exp(U  h2−j )]d (i 1)+k, (4) et al., 2011) and fine-tune them during training. l 1 × − i=1 h − ∑  2j+1    3 Experiments where 1 k d. ≤ ≤ ˆl The new activation hj is computed as: 3.1 Datasets

l 1 To evaluate our approach, we test our model on ˆl rL h2−j hj = tanh(Whˆ ⊙ l 1 ), (5) three datasets: rR h − [ ⊙ 2j+1 ] W d 2d r d r d r SST-1 The movie reviews with five classes where hˆ R × , L R , R R . L and • 1 ∈ ∈ ∈ l 1 in the Stanford Sentiment Treebank (Socher rR are the reset gates for left child node h2−j and l 1 et al., 2013b): negative, somewhat negative, right child node h2−j+1 respectively, which can be neutral, somewhat positive, positive. formalized as: l 1 The movie reviews with binary classes r h − SST-2 L 2j • 1 = σ(G l 1 ), (6) in the Stanford Sentiment Treebank (Socher rR h − [ ] [ 2j+1 ] et al., 2013b): negative, positive. 2d 2d where G R × is the coefficient of two reset ∈ QC The TREC questions dataset2 (Li and gates and σ indicates the sigmoid function. • Intuiativly, the reset gates control how to select Roth, 2002) involves six different question the output information of the left and right chil- types. dren, which result to the current new activation hˆ. By the update gates, the activation of a parent neu- 3.2 Hyper-parameters ron can be regarded as a choice among the the cur- Table 1 lists the hyper-parameters of our model. In ˆ rent new activation h, the left child, and the right this paper, we also exploit dropout strategy (Sri- child. This choice allows the overall structure to vastava et al., 2014) to avoid overfitting. In ad- change adaptively with respect to the inputs. dition, we set the batch size to 20. We set word This gate mechanism is effective to model the embedding size d = 50 on the TREC dataset combinations of features. and d = 100 on the Stanford Sentiment Treebank dataset. 2.3 Training We use the Maximum Likelihood (ML) criterion 3.3 Experiment Results to train our model. Given training set (xi, yi) and the parameter set of our model θ, the goal is to Table 2 shows the performance of our GRNN on minimize the loss function: three datasets. m 1http://nlp.stanford.edu/sentiment 1 λ 2 2 J(θ) = log P(yi xi; θ) + θ 2, (7) http://cogcomp.cs.illinois.edu/Data/ −m | 2m∥ ∥ QA/QC/ ∑i=1

795 Methods SST-1 SST-2 QC NBoW (Kalchbrenner et al., 2014) 42.4 80.5 88.2 PV (Le and Mikolov, 2014) 44.6∗ 82.7∗ 91.8∗ CNN-non-static (Kim, 2014) 48.0 87.2 93.6 CNN-multichannel (Kim, 2014) 47.4 88.1 92.2 MaxTDNN (Collobert and Weston, 2008) 37.4 77.1 84.4 DCNN (Kalchbrenner et al., 2014) 48.5 86.8 93.0 RecNTN (Socher et al., 2013b) 45.7 85.4 - RAE (Socher et al., 2011) 43.2 82.4 - MV-RecNN (Socher et al., 2012) 44.4 82.9 - AdaSent (Zhao et al., 2015) - - 92.4 GRNN (our approach) 47.5 85.5 93.8

Table 2: Performances of the different models. The result of PV is from our own implementation based on Gensim.

Competitor Models Neural Bag-of-Words Moreover, the plain GRNN which does not in- (NBOW) model is a simple and intuitive method corporate the gate mechanism cannot outperform which ignores the word order. Paragraph Vector the GRNN model. Theoretically, the plain GRNN (PV) (Le and Mikolov, 2014) learns continuous can be regarded as a special case of GRNN, whose distributed vector representations for pieces of parameters are constrained or truncated. As a re- texts, which can be regarded as a long term sult, GRNN is a more powerful model which out- memory of sentences as opposed to short memory performs the plain GRNN. Thus, we mainly focus in recurrent neural network. Here, we use the on the GRNN model in this paper. popular open source implementation of PV in Gensim1. Methods in the third block are CNN Result Discussion Generally, our model is bet- based models. Kim (2014) reports 4 different ter than the previous recursive neural network CNN models using max-over-time pooling, where based models (RecNTN, RAE, MV-RecNN and CNN-non-static and CNN-multichannel are more AdaSent), which indicates our model can better sophisticated. MaxTDNN sentence model is model the combinations of features with the FBT based on the architecture of the Time-Delay and our gating mechanism, even without an exter- Neural Network (TDNN) (Waibel et al., 1989; nal syntactic tree. Collobert and Weston, 2008). Dynamic convo- Although we just use the top layer outputs as lutional neural network (DCNN) (Kalchbrenner the feature for classification, our model still out- et al., 2014) uses the dynamic k-max pooling performs AdaSent. operator as a non-linear sub-sampling function, in Compared with the CNN based methods which the choice of k depends on the length of (MaxTDNN, DCNN and CNNs), our model given sentence. Methods in the fourth block are achieves the comparable performances with much RecNN based models. Recursive Neural Tensor fewer parameters. Although CNN based methods Network (RecNTN) (Socher et al., 2013b) is an outperform our model on SST-1 and SST-2, the 2 extension of plain RecNN, which also depends number of parameters of GRNN ranges from 40K on a external syntactic structure. Recursive to 160K while the number of parameters is about Autoencoder (RAE) (Socher et al., 2011) learns 400K in CNN. the representations of sentences by minimizing the reconstruction error. Matrix-Vector Recursive 4 Related Work Neural Network (MV-RecNN) (Socher et al., Cho et al. (2014) proposed grConv to model sen- 2012) is a extension of RecNN by assigning a tences for machine translation. Unlike our model, vector and a matrix to every node in the parse grConv uses the DAG structure as the topological tree. AdaSent (Zhao et al., 2015) adopts recursive structure to model sentences. The number of the neural network using DAG structure. 2We only take parameters of network into account, leav- 1https://github.com/piskvorky/gensim/ ing out word embeddings.

796 internal nodes is n2/2, where n is the length of the pendency parsing using two heterogeneous gated re- sentence. Zhao et al. (2015) uses the same struc- cursive neural networks. In Proceedings of the Con- ture to model sentences (called AdaSent), and uti- ference on Empirical Methods in Natural Language Processing. lizes the information of internal nodes to model sentences for text classification. Unlike grConv Kyunghyun Cho, Bart van Merrienboer,¨ Dzmitry Bah- and AdaSent, our model uses full binary tree as danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- the topological structure. The number of the in- proaches. arXiv preprint arXiv:1409.1259. ternal nodes is 2n in our model. Therefore, our Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, model is more efficient for long sentences. In ad- and Yoshua Bengio. 2014. Empirical evaluation of dition, we just use the top layer neurons for text gated recurrent neural networks on sequence model- classification. ing. arXiv preprint arXiv:1412.3555. Moreover, grConv and AdaSent only exploit Ronan Collobert and Jason Weston. 2008. A unified one gating mechanism (update gate), which cannot architecture for natural language processing: Deep sufficiently model the complicated feature com- neural networks with multitask learning. In Pro- binations. Unlike them, our model incorporates ceedings of ICML. two kind of gates and can better model the feature Ronan Collobert, Jason Weston, Leon´ Bottou, Michael combinations. Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Hu et al. (2014) also proposed a similar archi- 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Re- tecture for matching problems, but they employed search, 12:2493–2537. the convolutional neural network which might be coarse in modeling the feature combinations. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Ma- 5 Conclusion chine Learning Research, 12:2121–2159. In this paper, we propose a gated recursive neu- Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai ral network (GRNN) to recursively summarize the Chen. 2014. Convolutional neural network archi- meaning of sentence. GRNN uses full binary tree tectures for matching natural language sentences. as the recursive topological structure instead of an In Advances in Neural Information Processing Sys- tems. external syntactic tree. In addition, we introduce two kinds of gates to model the complicated com- Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- binations of features. In future work, we would som. 2014. A convolutional neural network for modelling sentences. In Proceedings of ACL. like to investigate the other gating mechanisms for better modeling the feature combinations. Yoon Kim. 2014. Convolutional neural net- works for sentence classification. arXiv preprint Acknowledgments arXiv:1408.5882. We would like to thank the anonymous review- Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Pro- ers for their valuable comments. This work was ceedings of ICML. partially funded by the National Natural Science Foundation of China (61472088, 61473092), Na- Xin Li and Dan Roth. 2002. Learning question classi- fiers. In Proceedings of the 19th International Con- tional High Technology Research and Develop- ference on Computational Linguistics, pages 556– ment Program of China (2015AA015408), Shang- 562. hai Science and Technology Development Funds Tomas Mikolov, Martin Karafiat,´ Lukas Burget, Jan (14ZR1403200). Cernocky,` and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In IN- TERSPEECH. References Jordan B Pollack. 1990. Recursive distributed repre- Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing sentations. Artificial Intelligence, 46(1):77–105. Huang. 2015a. Gated recursive neural network for Chinese word segmentation. In Proceedings of An- Richard Socher, Jeffrey Pennington, Eric H Huang, nual Meeting of the Association for Computational Andrew Y Ng, and Christopher D Manning. 2011. Linguistics. Semi-supervised recursive autoencoders for predict- ing sentiment distributions. In Proceedings of the Xinchi Chen, Yaqian Zhou, Chenxi Zhu, Xipeng Qiu, Conference on Empirical Methods in Natural Lan- and Xuanjing Huang. 2015b. Transition-based de- guage Processing, pages 151–161.

797 Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng. 2012. Semantic compositional- ity through recursive matrix-vector spaces. In Pro- ceedings of the 2012 Joint Conference on Empiri- cal Methods in Natural Language Processing and Computational Natural Language Learning, pages 1201–1211.

Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013a. Parsing with compo- sitional vector grammars. In In Proceedings of the ACL conference. Citeseer.

Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013b. Recursive deep models for semantic compositionality over a senti- ment treebank. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.

Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hin- ton, Kiyohiro Shikano, and Kevin J Lang. 1989. Phoneme recognition using time-delay neural net- works. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3):328–339.

Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-adaptive hierarchical sentence model. arXiv preprint arXiv:1504.05070.

798