Gated Recursive Neural Network for Chinese Word Segmentation

Xinchi Chen, Xipeng Qiu∗, Chenxi Zhu, Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China {xinchichen13,xpqiu,czhu13,xjhuang}@fudan.edu.cn

Abstract B M E S

Recently, neural network models for natu- ral language processing tasks have been in- creasingly focused on for their ability of al- leviating the burden of manual feature en- gineering. However, the previous neural 下 雨 天 地 面 积 水 models cannot extract the complicated fea- Rainy Day Ground Accumulated water ture compositions as the traditional meth- ods with discrete features. In this paper, Figure 1: Illustration of our model for Chinese we propose a gated recursive neural net- word segmentation. The solid nodes indicate the work (GRNN) for Chinese word segmen- active neurons, while the hollow ones indicate the tation, which contains reset and update suppressed neurons. Specifically, the links denote gates to incorporate the complicated com- the information flow, where the solid edges de- binations of the context characters. Since note the acceptation of the combinations while the GRNN is relative deep, we also use a dashed edges means rejection of that. As shown in supervised layer-wise training method to the right figure, we receive a score vector for tag- avoid the problem of gradient diffusion. ging target character “地” by incorporating all the Experiments on the benchmark datasets combination information. show that our model outperforms the pre- vious neural network models as well as the state-of-the-art methods. Recently, neural network models have been in- 1 Introduction creasingly focused on for their ability to minimize the effort in feature engineering. Collobert et al. Unlike English and other western languages, Chi- (2011) developed a general neural network archi- nese do not delimit words by white-space. There- tecture for sequence labeling tasks. Following this fore, word segmentation is a preliminary and im- work, many methods (Zheng et al., 2013; Pei et portant pre-process for Chinese language process- al., 2014; Qi et al., 2014) applied the neural net- ing. Most previous systems address this problem work to Chinese word segmentation and achieved by treating this task as a sequence labeling prob- a performance that approaches the state-of-the-art lem and have achieved great success. Due to the methods. nature of supervised learning, the performance of However, these neural models just concatenate these models is greatly affected by the design of the embeddings of the context characters, and feed features. These features are explicitly represented them into neural network. Since the concatena- by the different combinations of context charac- tion operation is relatively simple, it is difficult to ters, which are based on linguistic intuition and sta- model the complicated features as the traditional tistical information. However, the number of fea- discrete feature based models. Although the com- tures could be so large that the result models are plicated interactions of inputs can be modeled by too large to use in practice and prone to overfit on the deep neural network, the previous neural model training corpus. shows that the deep model cannot outperform the

∗Corresponding author. one with a single non-linear model. Therefore, the

1744 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1744–1753, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics neural model only captures the interactions by the Input Window C C C C C simple transition matrix and the single non-linear Characters i-2 i-1 i i+1 i+2 transformation . These dense features extracted via Lookup Table 1 these simple interactions are not nearly as good as 2 3 4 the substantial discrete features in the traditional 5 6 Features · · · · · · methods. · · · · · · · · · · · · d-1 In this paper, we propose a gated recursive neu- d ral network (GRNN) to model the complicated Concatenate combinations of characters, and apply it to Chi- Linear · · · W1 ×□+b1 nese word segmentation task. Inspired by the suc- Number of Hidden Units cess of gated (Chung et Sigmoid · al., 2014), we introduce two kinds of gates to con- g(□) · · trol the combinations in recursive structure. We Number of Hidden Units also use the layer-wise training method to avoid Linear · · · the problem of gradient diffusion, and the dropout W ×□+b 2 2 Number of tags strategy to avoid the overfitting problem. Tag Inference Figure 1 gives an illustration of how our ap- Aij proach models the complicated combinations of B the context characters. Given a sentence “雨 E

(Rainy) 天 (Day) 地面 (Ground) 积水 (Accumu- M lated water)”, the target character is “地”. This S sentence is very complicated because each consec- f(t|1) f(t|2) f(t|i) f(t|n-1) f(t|n) utive two characters can be combined as a word. To predict the label of the target character “地” un- Figure 2: General architecture of neural model for der the given context, GRNN detects the combina- Chinese word segmentation. tions recursively from the bottom layer to the top. Then, we receive a score vector of tags by incorpo- problem. Each character is labeled as one of {B, rating all the combination information in network. M, E, S} to indicate the segmentation. {B, M, E} The contributions of this paper can be summa- represent Begin, Middle, End of a multi-character rized as follows: segmentation respectively, and S represents a Sin- • We propose a novel GRNN architecture to gle character segmentation. model the complicated combinations of the The general neural network architecture for Chi- context characters. GRNN can select and pre- nese word segmentation task is usually character- serve the useful combinations via reset and ized by three specialized layers: (1) a character update gates. These combinations play a sim- embedding layer; (2) a series of classical neural ilar role in the feature engineering of the tra- network layers and (3) tag inference layer. A il- ditional methods with discrete features. lustration is shown in Figure 2. The most common tagging approach is based on • We evaluate the performance of Chinese a local window. The window approach assumes word segmentation on PKU, MSRA and that the tag of a character largely depends on its CTB6 benchmark datasets which are com- neighboring characters. monly used for evaluation of Chinese word Firstly, we have a character set of size . C |C| segmentation. Experiment results show that Then each character c is mapped into an d- our model outperforms other neural network ∈ C d dimensional embedding space as c R by a models, and achieves state-of-the-art perfor- d ∈ lookup table M R ×|C|. mance. ∈ For each character ci in a given sentence c1:n,

the context characters ci w1:i+w2 are mapped 2 Neural Model for Chinese Word − Segmentation to their corresponding character embeddings as ci w1:i+w2 , where w1 and w2 are left and right Chinese word segmentation task is usually re- context− lengths respectively. Specifically, the un- garded as a character-based sequence labeling known characters and characters exceeding the

1745 sentence boundaries are mapped to special sym- E M B S bols, “unknown”, “start” and “end” respectively. yi In addition, w1 and w2 satisfy the constraint w1 + Linear …… x w2 + 1 = w, where w is the window size of the yi = Ws × xi + bs i model. As an illustration in Figure 2, w1, w2 and Concatenate … … … … … … … … … … … … … … … … … … … … w are set to 2, 2 and 5 respectively. … … … … … … … … … … The embeddings of all the context characters are H1 then concatenated into a single vector ai R as ∈ input of the neural network, where H = w d is 1 × the size of Layer 1. And ai is then fed into a con- ventional neural network layer which performs a linear transformation followed by an element-wise activation function g, such as tanh.

hi = g(W1ai + b1), (1)

H2 H1 H2 H2 where W1 R × , b1 R , hi R . H2 ∈ ∈ ∈ is the number of hidden units in Layer 2. Here, w, H1 and H2 are hyper-parameters chosen on devel- … … … … …

opment set. … … … … … Then, a similar linear transformation is per- formed without non-linear function followed: ci-2 ci-1 ci ci+1 ci+2

f(t ci w1:i+w2 ) = W2hi + b2, (2) | − Figure 3: Architecture of Gated Recursive Neural T H2 T Network for Chinese word segmentation. where W2 R| |× , b2 R| | and T is the set of 4 possible∈ tags. Each∈ dimension of vector T f(t ci w1:i+w2 ) R| | is the score of the corre- their parent node h will be a -dimensional vec- | − ∈ P d sponding tag. tor as well, calculated as: To model the tag dependency, a transition score Aij is introduced to measure the probability of hL hP = g W , (3) jumping from tag i T to tag j T (Collobert et hR al., 2011). ∈ ∈ ( [ ]) d 2d where W R × and g is a non-linear function 3 Gated Recursive Neural Network for as mentioned∈ above. Chinese Word Segmentation 3.2 Gated Recursive Neural Network To model the complicated feature combinations, we propose a novel gated recursive neural network The RNN need a topological structure to model a (GRNN) architecture for Chinese word segmenta- sequence, such as a syntactic tree. In this paper, we tion task (see Figure 3). use a directed acyclic graph (DAG), as showing in Figure 3, to model the combinations of the input 3.1 Recursive Neural Network characters, in which two consecutive nodes in the A recursive neural network (RNN) is a kind of lower layer are combined into a single node in the deep neural network created by applying the same upper layer via the operation as Eq. (3). set of weights recursively over a given struc- In fact, the DAG structure can model the com- ture(such as parsing tree) in topological order (Pol- binations of characters by continuously mixing the lack, 1990; Socher et al., 2013a). information from the bottom layer to the top layer. In the simplest case, children nodes are com- Each neuron can be regarded as a complicated fea- bined into their parent node using a weight matrix ture composition of its governed characters, simi- W that is shared across the whole network, fol- lar to the discrete feature based models. The differ- lowed by a non-linear function g( ). Specifically, ence between them is that the neural one automat- · if hL and hR are d-dimensional vector representa- ically learns the complicated combinations while tions of left and right children nodes respectively, the conventional one need manually design them.

1746 When the children nodes combine into their parent (l) node, the combination information of two children hj nodes is also merged and preserved by their parent Gate z node. Although the mechanism above seem to work well, it can not sufficiently model the complicated h^(l) combination features for its simplicity in practice. j Inspired by the success of the gated recurrent neural network (Cho et al., 2014b; Chung et al., Gate rL Gate rR (l-1) (l-1) 2014), we propose a gated recursive neural net- hj-1 hj work (GRNN) by introducing two kinds of gates, namely “reset gate” and “update gate”. Specifi- Figure 4: Our proposed gated recursive unit. cally, there are two reset gates, rL and rR, par- tially reading the information from left child and d right child respectively. And the update gates z , where zN , zL and zR R are update gates N ∈ ˆl l 1 zL and zR decide what to preserve when combin- for new activation hj, left child node hj− 1 and ing the children’s information. Intuitively, these l 1 − right child node hj− respectively, and indicates gates seems to decide how to update and exploit element-wise multiplication. ⊙ the combination information. The update gates can be formalized as: In the case of word segmentation, for each char- l acter ci of a given sentence c1:n, we first repre- ˆ zN 1/Z hj sent each context character cj into its correspond- z z exp U l 1 = L = 1/Z (  hj− 1 ), ing embedding cj, where i w1 j i + w2     ⊙ l −1 − ≤ ≤ zR 1/Z h − and the definitions of w and w are as same as  j  1 2      (6) mentioned above. 3d 3d where U R × is the coefficient of update Then, the embeddings are sent to the first layer ∈ gates, and Z d is the vector of the normal- of GRNN as inputs, whose outputs are recursively R ization coefficients,∈ applied to upper layers until it outputs a single fixed-length vector. l 3 hˆ The outputs of the different neurons can be re- j exp U l 1 (7) Zk = [ (  hj− 1 )]d (i 1)+k, garded as the different feature compositions. After l −1 × − i=1 h − concatenating the outputs of all neurons in the net- ∑  j  work, we get a new big vector x . Next, we receive   i where 1 k d. the tag score vector yi for character cj by a linear ≤ ≤ Intuitively, three update gates are constrained transformation of x : i by: y = W x + b , (4) i s × i s T T Q where bs R| |, Ws R| |× . Q = q d is di- ∈ ∈ × [zN ]k + [zL]k + [zR]k = 1, 1 k d; mensionality of the concatenated vector xi, where ≤ ≤ q is the number of nodes in the network. [zN ]k 0, 1 k d;  ≥ ≤ ≤ (8) [zL]k 0, 1 k d; 3.3 Gated Recursive Unit  ≥ ≤ ≤ [zR]k 0, 1 k d. GRNN consists of the minimal structures, gated re- ≥ ≤ ≤  cursive units, as showing in Figure 4.   ˆl By assuming that the window size is w, we will The new activation hj is computed as: have recursion layer l [1, w]. At each recursion l 1 ∈ (l) r h layer l, the activation of the j-th hidden node h ˆl L j− 1 j hj = tanh(Whˆ ⊙ l−1 ), (9) d ∈ r h − R is computed as [ R ⊙ j ] { d 2d d d ˆl l 1 l 1 where Wˆ × , rL , rR . rL and (l) zN hj + zL hj− 1 + zR hj− , l > 1, h R R R hj = ⊙ ⊙ − ⊙ ∈ ∈ ∈ l 1 cj , l = 1, rR are the reset gates for left child node hj− 1 and (5) l 1 − right child node hj− respectively, which can be

1747 formalized as: network with adding the layers one by one. Specif- ically, we first train the neural network with the l 1 rL hj− 1 first hidden layer only. Then, we train at the net- = σ(G l−1 ), (10) rR h − work with two hidden layers after training at first [ ] [ j ] (11) layer is done and so on until we reach the top hid- den layer. When getting convergency of the net- 2d 2d work with layers 1 to l , we preserve the current where G R × is the coefficient of two reset ∈ gates and σ indicates the sigmoid function. parameters as initial values of that in training the Intuiativly, the reset gates control how to select network with layers 1 to l + 1. the output information of the left and right chil- 4.2 Max-Margin Criterion dren, which results to the current new activation hˆ. We use the Max-Margin criterion (Taskar et al., By the update gates, the activation of a parent 2005) to train our model. Intuitively, the Max- neuron can be regarded as a choice among the the Margin criterion provides an alternative to prob- current new activation hˆ, the left child, and the abilistic, likelihood based estimation methods by right child. This choice allows the overall structure concentrating directly on the robustness of the de- to change adaptively with respect to the inputs. cision boundary of a model. We use Y (xi) to de- This gating mechanism is effective to model the note the set of all possible tag sequences for a given combinations of the characters. sentence xi and the correct tag sequence for xi is yi. The parameter set of our model is θ. We first 3.4 Inference define a structured margin loss ∆(yi, yˆ) for pre- In Chinese word segmentation task, it is usually to dicting a tag sequence yˆ for a given correct tag se- employ the Viterbi algorithm to inference the tag quence yi: sequence t1:n for a given input sentence c1:n. n In order to model the tag dependencies, the ∆(yi, yˆ) = η1 yi,j =y ˆj , (13) { ̸ } previous neural network models (Collobert et al., ∑j 2011; Zheng et al., 2013; Pei et al., 2014) intro- where n is the length of sentence xi and η is a dis- duce a transition matrix A, and each entry Aij is count parameter. The loss is proportional to the the score of the transformation from tag i T to ∈ number of characters with an incorrect tag in the tag j T . ∈ predicted tag sequence. For a given training in- Thus, the sentence-level score can be formu- stance (xi, yi), we search for the tag sequence with lated as follows: the highest score: n y∗ = arg max s(xi, y,ˆ θ), (14) s(c1:n, t1:n, θ) = Ati 1ti + fθ(ti ci w1:i+w2 ) , − − yˆ Y (x) | ∈ ∑i=1 ( (12)) where the tag sequence is found and scored by the proposed model via the function s( ) in Eq. · where fθ(ti ci w1:i+w2 ) is the score for choosing | − (12). The object of Max-Margin training is that tag ti for the i-th character by our proposed GRNN the tag sequence with highest score is the correct (Eq. (4)). The parameter set of our model is θ = one: y∗ = yi and its score will be larger up to a (M, Ws, bs, Whˆ, U, G, A). margin to other possible tag sequences yˆ Y (x ): ∈ i 4 Training s(x, y , θ) s(x, y,ˆ θ) + ∆(y , yˆ). (15) i ≥ i 4.1 Layer-wise Training This leads to the regularized objective function for Deep neural network with multiple hidden layers m training examples: is very difficult to train for its problem of gradient m 1 λ diffusion and risk of overfitting. J(θ) = l (θ) + θ 2, (16) m i 2 ∥ ∥2 Following (Hinton and Salakhutdinov, 2006), i=1 we employ the layer-wise training strategy to avoid ∑ li(θ) = max (s(xi, y,ˆ θ)+∆(yi, yˆ)) s(xi, yi, θ). problems of overfitting and gradient vanishing. yˆ Y (xi) − The main idea of layer-wise training is to train the ∈ (17)

1748 By minimizing this object, the score of the correct Window size k = 5 tag sequence yi is increased and score of the high- Character embedding size d = 50 est scoring incorrect tag sequence yˆ is decreased. Initial learning rate α = 0.3 The objective function is not differentiable due to Margin loss discount η = 0.2 4 the hinge loss. We use a generalization of gradient Regularization λ = 10− descent called subgradient method (Ratliff et al., Dropout rate on input layer p = 20% 2007) which computes a gradient-like direction. Following (Socher et al., 2013a), we minimize Table 1: Hyper-parameter settings. the objective by the diagonal variant of AdaGrad (Duchi et al., 2011) with minibatchs. The parame- ter update for the i-th parameter θt,i at time step t is as follows: 96

α 1 layer θt,i = θt 1,i gt,i, (18) 94 − − t 2 2 layers τ=1 gτ,i 3 layers √ 4 layers ∑ 92 5 layers θi where α is the initial learning rate and gτ R| | F-value(%) layer-wise ∈ is the subgradient at time step τ for parameter θ . i 90 5 Experiments 88 We evaluate our model on two different kinds of 0 10 20 30 40 texts: newswire texts and micro-blog texts. For epoches evaluation, we use the standard Bakeoff scoring program to calculate precision, recall, F1-score. Figure 5: Performance of different models with or without layer-wise training strategy on PKU devel- 5.1 Word Segmentation on Newswire Texts opment set. 5.1.1 Datasets We use three popular datasets, PKU, MSRA and CTB6, to evaluate our model on newswire texts. find that it is a good balance between model per- The PKU and MSRA data are provided by the formance and efficiency to set character embed- second International Chinese Word Segmentation ding size d = 50. In fact, the larger embedding Bakeoff (Emerson, 2005), and CTB6 is from size leads to higher cost of computational resource, Chinese TreeBank 6.0 (LDC2007T36) (Xue et while lower dimensionality of the character em- al., 2005), which is a segmented, part-of-speech bedding seems to underfit according to the experi- tagged, and fully bracketed corpus in the con- ment results. stituency formalism. These datasets are commonly used by previous state-of-the-art models and neu- Deep neural networks contain multiple non- ral network models. In addition, we use the first linear hidden layers are always hard to train for it 90% sentences of the training data as training set is easy to overfit. Several methods have been used and the rest 10% sentences as development set for in neural models to avoid overfitting, such as early PKU and MSRA datasets, and we divide the train- stop and weight regularization. Dropout (Srivas- ing, development and test sets according to (Yang tava et al., 2014) is also one of the popular strate- and Xue, 2012) for the CTB6 dataset. gies to avoid overfitting when training the deep All datasets are preprocessed by replacing the neural networks. Hence, we utilize the dropout Chinese idioms and the continuous English char- strategy in this work. Specifically, dropout is to acters and digits with a unique flag. temporarily remove the neuron away with a fixed probability p independently, along with the incom- 5.1.2 Hyper-parameters ing and outgoing connections of it. As a result, We set the hyper-parameters of the model as list we find dropout on the input layer with probability in Table 1 via experiments on development set. p = 20% is a good tradeoff between model effi- In addition, we set the batch size to 20. And we ciency and performance.

1749 without layer-wise with layer-wise models P R F P R F GRNN (1 layer) 90.7 89.6 90.2 - - - GRNN (2 layers) 96.0 95.6 95.8 96.0 95.6 95.8 GRNN (3 layers) 95.9 95.4 95.7 96.0 95.7 95.9 GRNN (4 layers) 95.6 95.2 95.4 96.1 95.7 95.9 GRNN (5 layers) 95.3 94.7 95.0 96.1 95.7 95.9

Table 2: Performance of different models with or without layer-wise training strategy on PKU test set.

5.1.3 Layer-wise Training character embeddings instead of that with random We first investigate the effects of the layer-wise initialization. Thus, we pre-train the embeddings training strategy. Since we set the size of context on a huge unlabeled data, the Chinese Wikipedia window to five, there are five recursive layers in corpus, with word2vec toolkit (Mikolov et al., our architecture. And we train the networks with 2013). By using these obtained character embed- the different numbers of recursion layers. Due to dings, our model receives better performance and the limit of space, we just give the results on PKU still outperforms the previous neural models with dataset. pre-trained character embeddings. The detailed re- Figure 5 gives the convergence speeds of the sults are shown in Table 4 (1st to 3rd rows). five models with different numbers of layers and Inspired by (Pei et al., 2014), we utilize the bi- the model with layer-wise training strategy on de- gram feature embeddings in our model as well. velopment set of PKU dataset. The model with The concept of feature embedding is quite similar one layer just use the neurons of the lowest layer to that of character embedding mentioned above. in final linear score function. Since there are no Specifically, each context feature is represented as non-linear layer, its seems to underfit and perform a single vector called feature embedding. In this poorly. The model with two layers just use the paper, we only use the simply bigram feature em- neurons in the lowest two layers, and so on. The beddings initialized by the average of two embed- model with five layers use all the neurons in the dings of consecutive characters element-wisely. network. As we can see, the layer-wise training strategy lead to the fastest convergence and the Although the model of Pei et al. (2014) greatly best performance. benefits from the bigram feature embeddings, our Table 2 shows the performances on PKU test model just obtains a small improvement with them. set. The performance of the model with layer-wise This difference indicates that our model has well training strategy is always better than that with- modeled the combinations of the characters and do out layer-wise training strategy. With the increase not need much help of the feature engineering. The of the number of layers, the performance also in- detailed results are shown in Table 4 (4-th and 6-th creases and reaches the stable high performance rows). until getting to the top layer. Table 5 shows the comparisons of our model with the state-of-the-art systems on F-value. The 5.1.4 Results model proposed by Zhang and Clark (2007) is We first compare our model with the previous neu- a word-based segmentation method, which ex- ral approaches on PKU, MSRA and CTB6 datasets ploit features of complete words, while remains as showing in Table 3. The character embed- of the list are all character-based word segmenters, dings of the models are random initialized. The whose features are mostly extracted from the con- performance of word segmentation is significantly text characters. Moreover, some systems (such as boosted by exploiting the gated recursive archi- Sun and Xu (2011) and Zhang et al. (2013)) also tecture, which can better model the combinations exploit kinds of extra information such as the un- of the context characters than the previous neural labeled data or other knowledge. Although our models. model only uses simple bigram features, it outper- Previous works have proven it will greatly im- forms the previous state-of-the-art methods which prove the performance to exploit the pre-trained use more complex features.

1750 PKU MSRA CTB6 models P R F P R F P R F (Zheng et al., 2013) 92.8 92.0 92.4 92.9 93.6 93.3 94.0* 93.1* 93.6* (Pei et al., 2014) 93.7 93.4 93.5 94.6 94.2 94.4 94.4* 93.4* 93.9* GRNN 96.0 95.7 95.9 96.3 96.1 96.2 95.4 95.2 95.3

Table 3: Performances on PKU, MSRA and CTB6 test sets with random initialized character embeddings.

PKU MSRA CTB6 models P R F P R F P R F +Pre-train (Zheng et al., 2013) 93.5 92.2 92.8 94.2 93.7 93.9 93.9* 93.4* 93.7* (Pei et al., 2014) 94.4 93.6 94.0 95.2 94.6 94.9 94.2* 93.7* 94.0* GRNN 96.3 95.9 96.1 96.2 96.3 96.2 95.8 95.4 95.6 +bigram GRNN 96.6 96.2 96.4 97.5 97.3 97.4 95.9 95.7 95.8 +Pre-train+bigram (Pei et al., 2014) - 95.2 - - 97.2 - - - - GRNN 96.5 96.3 96.4 97.4 97.8 97.6 95.8 95.7 95.8

Table 4: Performances on PKU, MSRA and CTB6 test sets with pre-trained and bigram character em- beddings.

models PKU MSRA CTB6 shown in Table 6. (Tseng et al., 2005) 95.0 96.4 - To train our model, we also use the first 90% (Zhang and Clark, 2007) 95.1 97.2 - sentences of the training data as training set and (Sun and Xu, 2011) - - 95.7 the rest 10% sentences as development set. (Zhang et al., 2013) 96.1 97.4 - Here, we use the default setting of CRF++ This work 96.4 97.6 95.8 toolkit with the feature templates as shown in Ta- ble 7. The same feature templates are also used for Table 5: Comparison of GRNN with the state-of- FNLP. the-art methods on PKU, MSRA and CTB6 test sets. 5.2.2 Results Since the NLPCC 2015 dataset is a new released 5.2 Word Segmentation on Micro-blog Texts dataset, we compare our model with the two popu- lar open source toolkits for sequence labeling task: 5.2.1 Dataset FNLP3 (Qiu et al., 2013) and CRF++4. Our model We use the NLPCC 2015 dataset1 (Qiu et al., 2015) uses pre-trained and bigram character embeddings. to evaluate our model on micro-blog texts. The Table 8 shows the comparisons of our model NLPCC 2015 data are provided by the shared task with the other systems on NLPCC 2015 dataset. in the 4th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2015): 6 Related Work Chinese Word Segmentation and POS Tagging for Chinese word segmentation has been studied with micro-blog Text. Different with the popular used considerable efforts in the NLP community. The newswire dataset, the NLPCC 2015 dataset is col- most popular word segmentation method is based lected from Sina Weibo2, which consists of the on sequence labeling (Xue, 2003). Recently, re- relatively informal texts from micro-blog with the searchers have tended to explore neural network various topics, such as finance, sports, entertain- ment, and so on. The information of the dataset is 3https://github.com/xpqiu/fnlp/ 4http://taku910.github.io/crfpp/ 1http://nlp.fudan.edu.cn/nlpcc2015/ *The result is from our own implementation of the corre- 2http://www.weibo.com/ sponding method.

1751 Dataset Sents Words Chars Word Types Char Types OOV Rate Training 10,000 215,027 347,984 28,208 39,71 - Test 5,000 106,327 171,652 18,696 3,538 7.25% Total 15,000 322,410 520,555 35,277 4,243 -

Table 6: Statistical information of NLPCC 2015 dataset.

unigram feature c 2, c 1, c0, c+1, c+2 ing which we can better model the combinations − − bigram feature c 1 c0, c0 c+1 of context characters via selection function of re- − ◦ ◦ trigram feature c 2 c 1 c0, c 1 c0 c+1, set gate and collection function of update gate. − ◦ − ◦ − ◦ ◦ c0 c+1 c+2 ◦ ◦ 7 Conclusion Table 7: Templates of CRF++ and FNLP. In this paper, we propose a gated recursive neu- models P R F ral network (GRNN) to explicitly model the com- CRF++ 93.3 93.2 93.3 binations of the characters for Chinese word seg- FNLP 94.1 93.9 94.0 mentation task. Each neuron in GRNN can be re- This work 94.7 94.8 94.8 garded as a different combination of the input char- acters. Thus, the whole GRNN has an ability to Table 8: Performances on NLPCC 2015 dataset. simulate the design of the sophisticated features in traditional methods. Experiments show that our proposed model outperforms the state-of-the-art based approaches (Collobert et al., 2011) to re- methods on three popular benchmark datasets. duce efforts of the feature engineering (Zheng et Despite Chinese word segmentation being a spe- al., 2013; Qi et al., 2014). However, the features cific case, our model can be easily generalized and of all these methods are the concatenation of the applied to other sequence labeling tasks. In future embeddings of the context characters. work, we would like to investigate our proposed Pei et al. (2014) also used neural tensor model GRNN on other sequence labeling tasks. (Socher et al., 2013b) to capture the complicated interactions between tags and context characters. Acknowledgments But the interactions depend on the number of the tensor slices, which cannot be too large due to the We would like to thank the anonymous review- model complexity. The experiments also show ers for their valuable comments. This work that the model of (Pei et al., 2014) greatly bene- was partially funded by the National Natural Sci- fits from the further bigram feature embeddings, ence Foundation of China (61472088, 61473092), which shows that their model cannot even handle the National High Technology Research and De- the interactions of the consecutive characters. Dif- velopment Program of China (2015AA015408), ferent with them, our model just has a small im- Shanghai Science and Technology Development provement with the bigram feature embeddings, Funds (14ZR1403200), Shanghai Leading Aca- which indicates that our approach has well mod- demic Discipline Project (B114). eled the complicated combinations of the context characters, and does not need much help of further References feature engineering. More recently, Cho et al. (2014a) also proposed Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- a gated recursive convolutional neural network in danau, and Yoshua Bengio. 2014a. On the proper- ties of neural machine translation: Encoder–decoder machine translation task to solve the problem of approaches. In Proceedings of Workshop on Syntax, varying lengths of sentences. However, their ap- Semantics and Structure in Statistical Translation. proach only models the update gate, which can not tell whether the information is from the current Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- cehre, Fethi Bougares, Holger Schwenk, and Yoshua state or from sub notes in update stage without re- Bengio. 2014b. Learning phrase representations set gate. Instead, our approach models two kinds using rnn encoder-decoder for statistical machine of gates, reset gate and update gate, by incorporat- translation. In Proceedings of EMNLP.

1752 Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Richard Socher, Danqi Chen, Christopher D Manning, and Yoshua Bengio. 2014. Empirical evaluation of and Andrew Ng. 2013b. Reasoning with neural ten- gated recurrent neural networks on sequence model- sor networks for knowledge base completion. In Ad- ing. arXiv preprint arXiv:1412.3555. vances in Neural Information Processing Systems.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. 2011. Natural language processing (almost) from Dropout: A simple way to prevent neural networks scratch. The Journal of Machine Learning Research, from overfitting. The Journal of Machine Learning 12:2493–2537. Research, 15(1):1929–1958. Weiwei Sun and Jia Xu. 2011. Enhancing Chinese John Duchi, Elad Hazan, and Yoram Singer. 2011. word segmentation using unlabeled data. In Pro- Adaptive subgradient methods for online learning ceedings of the Conference on Empirical Methods in and stochastic optimization. The Journal of Machine Natural Language Processing, pages 970–979. As- Learning Research, 12:2121–2159. sociation for Computational Linguistics. T. Emerson. 2005. The second international Chi- Ben Taskar, Vassil Chatalbashev, Daphne Koller, and nese word segmentation bakeoff. In Proceedings of Carlos Guestrin. 2005. Learning structured pre- the Fourth SIGHAN Workshop on Chinese Language diction models: A large margin approach. In Pro- Processing, pages 123–133. Jeju Island, Korea. ceedings of the international conference on Machine learning. Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural net- Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel works. Science, 313(5786):504–507. Jurafsky, and Christopher Manning. 2005. A condi- tional random field word segmenter for sighan bake- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- off 2005. In Proceedings of the fourth SIGHAN frey Dean. 2013. Efficient estimation of word workshop on Chinese language Processing, volume representations in vector space. arXiv preprint 171. arXiv:1301.3781. Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Wenzhe Pei, Tao Ge, and Chang Baobao. 2014. Max- Palmer. 2005. The Penn Chinese TreeBank: Phrase margin tensor neural network for chinese word seg- structure annotation of a large corpus. Natural lan- mentation. In Proceedings of ACL. guage engineering, 11(2):207–238. N. Xue. 2003. Chinese word segmentation as charac- Jordan B Pollack. 1990. Recursive distributed repre- ter tagging. Computational Linguistics and Chinese sentations. , 46(1):77–105. Language Processing, 8(1):29–48.

Yanjun Qi, Sujatha G Das, Ronan Collobert, and Jason Yaqin Yang and Nianwen Xue. 2012. Chinese comma Weston. 2014. for character-based disambiguation for discourse analysis. In Proceed- information extraction. In Advances in Information ings of the 50th Annual Meeting of the Associa- Retrieval, pages 668–674. Springer. tion for Computational Linguistics: Long Papers- Volume 1, pages 786–794. Association for Compu- Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2013. tational Linguistics. FudanNLP: A toolkit for Chinese natural language processing. In Proceedings of Annual Meeting of the Yue Zhang and Stephen Clark. 2007. Chinese segmen- Association for Computational Linguistics. tation with a word-based perceptron algorithm. In ACL. Xipeng Qiu, Peng Qian, Liusong Yin, and Xuan- jing Huang. 2015. Overview of the NLPCC Longkai Zhang, Houfeng Wang, Xu Sun, and Mairgup 2015 shared task: Chinese word segmentation and Mansur. 2013. Exploring representations from un- POS tagging for micro-blog texts. arXiv preprint labeled data with co-training for Chinese word seg- arXiv:1505.07599. mentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Process- Nathan D Ratliff, J Andrew Bagnell, and Martin A ing. Zinkevich. 2007. (online) subgradient methods Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. for structured prediction. In Eleventh International Deep learning for chinese word segmentation and Conference on Artificial Intelligence and Statistics pos tagging. In EMNLP, pages 647–657. (AIStats).

Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng. 2013a. Parsing with compo- sitional vector grammars. In In Proceedings of the ACL conference. Citeseer.

1753