Sentence Modeling with Gated Recursive Neural Network

Sentence Modeling with Gated Recursive Neural Network Xinchi Chen, Xipeng Qiu,∗ Chenxi Zhu, Shiyu Wu, Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China xinchichen13,xpqiu,czhu13,syu13,xjhuang @fudan.edu.cn { } Abstract Recently, neural network based sentence modeling methods have achieved great progress. Among these methods, the re- I cannot agree with you more I cannot agree with you more cursive neural networks (RecNNs) can effectively model the combination of the Figure 1: Example of Gated Recursive Neural words in sentence. However, RecNNs Networks (GRNNs). Left is a GRNN using a di- need a given external topological struc- rected acyclic graph (DAG) structure. Right is a ture, like syntactic tree. In this paper, GRNN using a full binary tree (FBT) structure. we propose a gated recursive neural net- (The green nodes, gray nodes and white nodes work (GRNN) to model sentences, which illustrate the positive, negative and neutral senti- employs a full binary tree (FBT) struc- ments respectively.) ture to control the combinations in recursive structure. By introducing two kinds of gates, our model can better model to model sentences. However, DAG structure is the complicated combinations of features. relatively complicated. The number of the hidden Experiments on three text classification neurons quadraticly increases with the length of datasets show the effectiveness of our sentences so that grConv cannot effectively deal model. with long sentences. Inspired by grConv, we propose a gated recur- 1 Introduction sive neural network (GRNN) for sentence modeling. Different with grConv, we use the full binary Recently, neural network based sentence modeling tree (FBT) as the topological structure to recur- approaches have been increasingly focused on for sively model the word combinations, as shown in their ability to minimize the efforts in feature en- Figure 1. The number of the hidden neurons lin- gineering, such as Neural Bag-of-Words (NBoW), early increases with the length of sentences. An- Recurrent Neural Network (RNN) (Mikolov et al., other difference is that we introduce two kinds of 2010), Recursive Neural Network (RecNN) (Pol- gates, reset and update gates (Chung et al., 2014), lack, 1990; Socher et al., 2013b; Socher et al., to control the combinations in recursive structure. 2012) and Convolutional Neural Network (CNN) With these two gating mechanisms, our model can (Kalchbrenner et al., 2014; Hu et al., 2014). better model the complicated combinations of fea- Among these methods, recursive neural net- tures and capture the long dependency interac- works (RecNNs) have shown their excellent abil- tions. ities to model the word combinations in sentence. In our previous works, we have investigated However, RecNNs require a pre-defined topolog- several different topological structures (tree and ical structure, like parse tree, to encode sentence, directed acyclic graph) to recursively model the which limits the scope of its application. Cho et al. semantic composition from the bottom layer to the (2014) proposed the gated recursive convolutional top layer, and applied them on Chinese word seg- neural network (grConv) by utilizing the directed mentation (Chen et al., 2015a) and dependency acyclic graph (DAG) structure instead of parse tree parsing (Chen et al., 2015b) tasks. However, these ∗Corresponding author. structures are not suitable for modeling sentences. 793 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 793–798, Lisbon, Portugal, 17-21 September 2015. c 2015 Association for Computational Linguistics. … P(·|xi;θ) (l) hj Softmax(Ws × ui + bs) Gate z ui ^(l) hj Gate rL Gate rR (l-1) (l-1) 0 0 0 h2j h2j+1 Figure 3: Our proposed gated recursive unit. … … … … … … … … left child and right child respectively. And the up- (i) (i) (i) (i) (i) date gates zN , zL and zR decide what to preserve w1 w2 w3 w4 w5 when combining the children’s information. Intu- Figure 2: Architecture of Gated Recursive Neural itively, these gates seem to decide how to update Network (GRNN). and exploit the combination information. In the case of text classification, for each given sentence x = w(i) and the corresponding class In this paper, we adopt the full binary tree as the i 1:N(i) (i) topological structure to reduce the model com- yi, we first represent each word wj into its corre- d plexity. sponding embedding w (i) R , where N(i) in- wj ∈ Experiments on the Stanford Sentiment Tree- dicates the length of i-th sentence and d is dimen- bank dataset (Socher et al., 2013b) and the TREC sionality of word embeddings. Then, the embed- questions dataset (Li and Roth, 2002) show the ef- dings are sent to the first layer of GRNN as inputs, fectiveness of our approach. whose outputs are recursively applied to upper lay- ers until it outputs a single fixed-length vector. 2 Gated Recursive Neural Network Next, we receive the class distribution P( x ; θ) ·| i 2.1 Architecture for the given sentence xi by a softmax transforma- u u The recursive neural network (RecNN) need a tion of i, where i is the top node of the network topological structure to model a sentence, such as (a fixed length vectorial representation): a syntactic tree. In this paper, we use a full binary P( x ; θ) = softmax(W u + b ), (1) tree (FBT), as showing in Figure 2, to model the ·| i s × i s combinations of features for a given sentence. T T d where bs R| |, Ws R| |× . d is the dimen- In fact, the FBT structure can model the com- ∈ ∈ sionality of the top node ui, which is same with binations of features by continuously mixing the the word embedding size and T represents the set information from the bottom layer to the top layer. of possible classes. θ represents the parameter set. Each neuron can be regarded as a complicated feature composition of its governed sub-sentence. 2.2 Gated Recursive Unit When the children nodes combine into their parent GRNN consists of the minimal structures, gated node, the combination information of two children recursive units, as showing in Figure 3. nodes is also merged and preserved by their par- By assuming that the length of sentence is n, we ent node. As shown in Figure 2, we put all-zero will have recursion layer l [1, logn +1], where padding vectors after the last word of the sentence ∈ ⌈ 2 ⌉ n symbol q indicates the minimal integer q∗ q. log2 ⌈ ⌉ ≥ until the length of 2⌈ ⌉, where n is the length of At each recursion layer l, the activation of the j- the given sentence. logn l (l) d th (j [0, 2 2 − )) hidden node h R is Inspired by the success of the gate mechanism ∈ ⌈ ⌉ j ∈ computed as of Chung et al. (2014), we further propose a gated recursive neural network (GRNN) by introducing ( ˆl l 1 l 1 two kinds of gates, namely “reset gate” and “up- (l) zN hj + zL h − + zR h − , l > 1, h = ⊙ ⊙ 2j ⊙ 2j+1 date gate”. Specifically, there are two reset gates, j corresponding word embedding, l = 1, (2) rL and rR, partially reading the information from 794 d where zN , zL and zR R are update gates Initial learning rate α = 0.3 l ∈ 4 ˆ l 1 Regularization λ = 10− for new activation hj, left child node h2−j and l 1 Dropout rate on input layer p = 20% right child node h − respectively, and indi- 2j+1 ⊙ cates element-wise multiplication. Table 1: Hyper-parameter settings. The update gates can be formalized as: 2 3 2 3 2 3 ˆl zN 1/Z 6 hj 7 4 5 4 5 l 1 where m is number of training sentences. z = zL = 1/Z exp(U 4 h − 5), (3) ⊙ 2j zR 1/Z l 1 Following (Socher et al., 2013a), we use the di- h2−j+1 agonal variant of AdaGrad (Duchi et al., 2011) 3d 3d where U R × is the coefficient of update with minibatchs to minimize the objective. ∈ gates, and Z Rd is the vector of the normaliza- For parameter initialization, we use random ini- ∈ tion coefficients, tialization within (-0.01, 0.01) for all parameters except the word embeddings. We adopt the pre- ˆl 3 hj trained English word embeddings from (Collobert l 1 Zk = [exp(U h2−j )]d (i 1)+k, (4) et al., 2011) and fine-tune them during training. l 1 × − i=1 h − ∑ 2j+1 3 Experiments where 1 k d. ≤ ≤ ˆl The new activation hj is computed as: 3.1 Datasets l 1 To evaluate our approach, we test our model on ˆl rL h2−j hj = tanh(Whˆ ⊙ l 1 ), (5) three datasets: rR h − [ ⊙ 2j+1 ] W d 2d r d r d r SST-1 The movie reviews with five classes where hˆ R × , L R , R R . L and • 1 ∈ ∈ ∈ l 1 in the Stanford Sentiment Treebank (Socher rR are the reset gates for left child node h2−j and l 1 et al., 2013b): negative, somewhat negative, right child node h2−j+1 respectively, which can be neutral, somewhat positive, positive. formalized as: l 1 The movie reviews with binary classes r h − SST-2 L 2j • 1 = σ(G l 1 ), (6) in the Stanford Sentiment Treebank (Socher rR h − [ ] [ 2j+1 ] et al., 2013b): negative, positive. 2d 2d where G R × is the coefficient of two reset ∈ QC The TREC questions dataset2 (Li and gates and σ indicates the sigmoid function. • Intuiativly, the reset gates control how to select Roth, 2002) involves six different question the output information of the left and right chil- types.

Load more