Gated Recursive Neural Network for Chinese Word Segmentation

Gated Recursive Neural Network for Chinese Word Segmentation Xinchi Chen, Xipeng Qiu∗, Chenxi Zhu, Xuanjing Huang Shanghai Key Laboratory of Intelligent Information Processing, Fudan University School of Computer Science, Fudan University 825 Zhangheng Road, Shanghai, China {xinchichen13,xpqiu,czhu13,xjhuang}@fudan.edu.cn Abstract B M E S Recently, neural network models for natural language processing tasks have been in- creasingly focused on for their ability of al- leviating the burden of manual feature engineering. However, the previous neural 下雨天地面积水 models cannot extract the complicated fea- Rainy Day Ground Accumulated water ture compositions as the traditional methods with discrete features. In this paper, Figure 1: Illustration of our model for Chinese we propose a gated recursive neural net- word segmentation. The solid nodes indicate the work (GRNN) for Chinese word segmen- active neurons, while the hollow ones indicate the tation, which contains reset and update suppressed neurons. Specifically, the links denote gates to incorporate the complicated com- the information flow, where the solid edges de- binations of the context characters. Since note the acceptation of the combinations while the GRNN is relative deep, we also use a dashed edges means rejection of that. As shown in supervised layer-wise training method to the right figure, we receive a score vector for tag- avoid the problem of gradient diffusion. ging target character “地” by incorporating all the Experiments on the benchmark datasets combination information. show that our model outperforms the previous neural network models as well as the state-of-the-art methods. Recently, neural network models have been in- 1 Introduction creasingly focused on for their ability to minimize the effort in feature engineering. Collobert et al. Unlike English and other western languages, Chi- (2011) developed a general neural network archi- nese do not delimit words by white-space. There- tecture for sequence labeling tasks. Following this fore, word segmentation is a preliminary and im- work, many methods (Zheng et al., 2013; Pei et portant pre-process for Chinese language process- al., 2014; Qi et al., 2014) applied the neural net- ing. Most previous systems address this problem work to Chinese word segmentation and achieved by treating this task as a sequence labeling prob- a performance that approaches the state-of-the-art lem and have achieved great success. Due to the methods. nature of supervised learning, the performance of However, these neural models just concatenate these models is greatly affected by the design of the embeddings of the context characters, and feed features. These features are explicitly represented them into neural network. Since the concatena- by the different combinations of context charac- tion operation is relatively simple, it is difficult to ters, which are based on linguistic intuition and sta- model the complicated features as the traditional tistical information. However, the number of fea- discrete feature based models. Although the com- tures could be so large that the result models are plicated interactions of inputs can be modeled by too large to use in practice and prone to overfit on the deep neural network, the previous neural model training corpus. shows that the deep model cannot outperform the ∗Corresponding author. one with a single non-linear model. Therefore, the 1744 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1744–1753, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics neural model only captures the interactions by the Input Window C C C C C simple transition matrix and the single non-linear Characters i-2 i-1 i i+1 i+2 transformation . These dense features extracted via Lookup Table 1 these simple interactions are not nearly as good as 2 3 4 the substantial discrete features in the traditional 5 6 Features · · · · · · methods. · · · · · · · · · · · · d-1 In this paper, we propose a gated recursive neu- d ral network (GRNN) to model the complicated Concatenate combinations of characters, and apply it to Chi- Linear · · · W1 ×□+b1 nese word segmentation task. Inspired by the suc- Number of Hidden Units cess of gated recurrent neural network (Chung et Sigmoid · al., 2014), we introduce two kinds of gates to con- g(□) · · trol the combinations in recursive structure. We Number of Hidden Units also use the layer-wise training method to avoid Linear · · · the problem of gradient diffusion, and the dropout W ×□+b 2 2 Number of tags strategy to avoid the overfitting problem. Tag Inference Figure 1 gives an illustration of how our ap- Aij proach models the complicated combinations of B the context characters. Given a sentence “雨 E (Rainy) 天 (Day) 地面 (Ground) 积水 (Accumu- M lated water)”, the target character is “地”. This S sentence is very complicated because each consec- f(t|1) f(t|2) f(t|i) f(t|n-1) f(t|n) utive two characters can be combined as a word. To predict the label of the target character “地” un- Figure 2: General architecture of neural model for der the given context, GRNN detects the combina- Chinese word segmentation. tions recursively from the bottom layer to the top. Then, we receive a score vector of tags by incorpo- problem. Each character is labeled as one of {B, rating all the combination information in network. M, E, S} to indicate the segmentation. {B, M, E} The contributions of this paper can be summa- represent Begin, Middle, End of a multi-character rized as follows: segmentation respectively, and S represents a Sin- • We propose a novel GRNN architecture to gle character segmentation. model the complicated combinations of the The general neural network architecture for Chi- context characters. GRNN can select and pre- nese word segmentation task is usually character- serve the useful combinations via reset and ized by three specialized layers: (1) a character update gates. These combinations play a sim- embedding layer; (2) a series of classical neural ilar role in the feature engineering of the tra- network layers and (3) tag inference layer. A il- ditional methods with discrete features. lustration is shown in Figure 2. The most common tagging approach is based on • We evaluate the performance of Chinese a local window. The window approach assumes word segmentation on PKU, MSRA and that the tag of a character largely depends on its CTB6 benchmark datasets which are com- neighboring characters. monly used for evaluation of Chinese word Firstly, we have a character set of size . C |C| segmentation. Experiment results show that Then each character c is mapped into an d- our model outperforms other neural network ∈ C d dimensional embedding space as c R by a models, and achieves state-of-the-art perfor- d ∈ lookup table M R ×|C|. mance. ∈ For each character ci in a given sentence c1:n, the context characters ci w1:i+w2 are mapped 2 Neural Model for Chinese Word − Segmentation to their corresponding character embeddings as ci w1:i+w2 , where w1 and w2 are left and right Chinese word segmentation task is usually re- context− lengths respectively. Specifically, the un- garded as a character-based sequence labeling known characters and characters exceeding the 1745 sentence boundaries are mapped to special sym- E M B S bols, “unknown”, “start” and “end” respectively. yi In addition, w1 and w2 satisfy the constraint w1 + Linear …… x w2 + 1 = w, where w is the window size of the yi = Ws × xi + bs i model. As an illustration in Figure 2, w1, w2 and Concatenate … … … … … … … … … … … … … … … … … … … … w are set to 2, 2 and 5 respectively. … … … … … … … … … … The embeddings of all the context characters are H1 then concatenated into a single vector ai R as ∈ input of the neural network, where H = w d is 1 × the size of Layer 1. And ai is then fed into a con- ventional neural network layer which performs a linear transformation followed by an element-wise activation function g, such as tanh. hi = g(W1ai + b1), (1) H2 H1 H2 H2 where W1 R × , b1 R , hi R . H2 ∈ ∈ ∈ is the number of hidden units in Layer 2. Here, w, H1 and H2 are hyper-parameters chosen on devel- … … … … … opment set. … … … … … Then, a similar linear transformation is per- formed without non-linear function followed: ci-2 ci-1 ci ci+1 ci+2 f(t ci w1:i+w2 ) = W2hi + b2, (2) | − Figure 3: Architecture of Gated Recursive Neural T H2 T Network for Chinese word segmentation. where W2 R| |× , b2 R| | and T is the set of 4 possible∈ tags. Each∈ dimension of vector T f(t ci w1:i+w2 ) R| | is the score of the corre- their parent node h will be a -dimensional vec- | − ∈ P d sponding tag. tor as well, calculated as: To model the tag dependency, a transition score Aij is introduced to measure the probability of hL hP = g W , (3) jumping from tag i T to tag j T (Collobert et hR al., 2011). ∈ ∈ ( [ ]) d 2d where W R × and g is a non-linear function 3 Gated Recursive Neural Network for as mentioned∈ above. Chinese Word Segmentation 3.2 Gated Recursive Neural Network To model the complicated feature combinations, we propose a novel gated recursive neural network The RNN need a topological structure to model a (GRNN) architecture for Chinese word segmenta- sequence, such as a syntactic tree. In this paper, we tion task (see Figure 3). use a directed acyclic graph (DAG), as showing in Figure 3, to model the combinations of the input 3.1 Recursive Neural Network characters, in which two consecutive nodes in the A recursive neural network (RNN) is a kind of lower layer are combined into a single node in the deep neural network created by applying the same upper layer via the operation as Eq.

Load more