Exploiting Chunk-level Features to Improve Phrase Chunking

Junsheng Zhou Weiguang Qu Fen Zhang Jiangsu Research Center of Information Security & Privacy Technology School of Computer Science and Technology Nanjing Normal University. Nanjing, China, 210046 Email:{zhoujs,wgqu}@njnu.edu.cn [email protected]

and noun phrase (NP) chunking. Phrase chunking Abstract provides a key feature that helps on more elaborated NLP tasks such as , semantic Most existing systems solved the phrase role tagging and . chunking task with the sequence labeling There is a wide range of research work on approaches, in which the chunk candidates phrase chunking based on cannot be treated as a whole during parsing approaches. However, most of the previous work process so that the chunk-level features reduced phrase chunking to sequence labeling cannot be exploited in a natural way. In this problems either by using the classification models, paper, we formulate phrase chunking as a such as SVM (Kudo and Matsumoto, 2001), joint segmentation and labeling task. We Winnow and voted-perceptrons (Zhang et al., 2002; propose an efficient dynamic programming Collins, 2002), or by using the sequence labeling algorithm with pruning for decoding, models, such as Hidden Markov Models (HMMs) which allows the direct use of the features (Molina and Pla, 2002) and Conditional Random describing the internal characteristics of Fields (CRFs) (Sha and Pereira, 2003). When chunk and the features capturing the applying the sequence labeling approaches to correlations between adjacent chunks. A phrase chunking, there exist two major problems. relaxed, online maximum margin training Firstly, these models cannot treat globally a algorithm is used for learning. Within this sequence of continuous as a chunk framework, we explored a variety of candidate, and thus cannot inspect the internal effective feature representations for structure of the candidate, which is an important Chinese phrase chunking. The aspect of information in modeling phrase chunking. experimental results show that the use of In particular, it makes impossible the use of local chunk-level features can lead to significant indicator function features of the type "the chunk performance improvement, and that our consists of POS tag sequence p ...,p ". For example, approach achieves state-of-the-art 1 k the Chinese NP " 农业/NN(agriculture) 生产 performance. In particular, our approach is much better at recognizing long and /NN(production) 和/CC(and) 农村/NN(rural) 经济 complicated phrases. /NN(economic) 发展/NN(development)" seems relatively difficult to be correctly recognized by a sequence labeling approach due to its length. But if 1 Introduction we can treat the sequence of words as a whole and describe the formation pattern of POS tags of this Phrase chunking is a Natural Language Processing chunk with a regular expression-like form task that consists in dividing a text into "[NN]+[CC][NN]+", then it is more likely to be syntactically correlated parts of words. Theses correctly recognized, since this pattern might better phrases are non-overlapping, i.e., a can only express the characteristics of its constituents. As be a member of one chunk (Abney, 1991). another example, consider the recognition of Generally speaking, there are two phrase chunking special terms. In Chinese corpus, there exists a tasks, including text chunking (shallow parsing), kind of NPs called special terms, such as "『 生命

557

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 557–567, Jeju Island, Korea, 12–14 July 2012. c 2012 Association for Computational Linguistics

(Life) 禁区(Forbidden Zone) 』 ", which are and that our approach performs better than other bracketed with the particular punctuations like " approaches based on the sequence labeling models. 『, 』, 「, 」, 《, 》". When recognizing the special terms, it is difficult for the sequence 2 Related Work labeling approaches to guarantee the matching of In recent years, many chunking systems based on particular punctuations appearing at the starting machine learning approaches have been presented. and ending positions of a chunk. For instance, the Some approaches rely on k-order generative chunk candidate "『 生命(Life) 禁区(Forbidden probabilistic models, such as HMMs (Molina and Zone)” is considered to be an invalid chunk. But Pla, 2002). However, HMMs learn a generative it is easy to check this kind of punctuation model over input sequence and labeled sequence matching in a single chunk by introducing a chunk- pairs. It has difficulties in modeling multiple non- level feature. independent features of the observation sequence. Secondly, the sequence labeling models cannot To accommodate multiple overlapping features on capture the correlations between adjacent chunks, observations, some other approaches view the which should be informative for the identification phrase chunking as a sequence of classification of chunk boundaries and types. In particular, we problems, including support vector machines find that some headwords in the sentence are (SVMs) (Kudo and Matsumoto 2001) and a variety expected to have a stronger dependency relation of other classifiers (Zhang et al., 2002). Since these with their preceding headwords in preceding classifiers cannot trade off decisions at different chunks than with their immediately preceding positions against each other, the best classifier words within the same chunk. For example, in the based shallow parsers are forced to resort to following sentence: heuristic combinations of multiple classifiers. " [双方/PN(Bilateral)]_NP [经贸/NN(economic Recently, CRFs were widely employed for phrase and trade) 关系/NN(relations)]_NP [正/AD(just) chunking, and presented comparable or better 稳步/AD(steadily) 发展/VV(develop)]_VP " performance than other state-of-the-art models (Sha and Pereira 2003; McDonald et al. 2005). if we can find the three headwords "双方", "关系" Further, Sun et al. (2008) used the latent-dynamic 发展 and " " located in the three adjacent chunks conditional random fields (LDCRF) to explicitly with some head-finding rules, then the headword learn the hidden substructure of shallow phrases, dependency expressed by headword or achieving state-of-the-art performance over the should be helpful to recognize these NP-chunking task on the CoNLL data. chunks in this sentence. Some similar approaches based on classifiers or In summary, the inherent deficiency in applying sequence labeling models were also used for the sequence labeling approaches to phrase Chinese chunking (Li et al., 2003; Tan et al., 2004; chunking is that the chunk-level features one Tan et al., 2005). Chen et al. (2006) conducted an would expect to be very informative cannot be empirical study of Chinese chunking on a corpus, exploited in a natural way. which was extracted from UPENN Chinese In this paper, we formulate phrase chunking as a -4 (CTB4). They compared the joint segmentation and labeling problem, which performances of the state-of-the-art machine offers advantages over previous learning methods learning models for Chinese chunking, and by providing a natural formulation to exploit the proposed some Tag-Extension and novel voting features describing the internal structure of a chunk methods to improve performance. and the features capturing the correlations between In this paper, we model phrase chunking with a the adjacent chunks. joint segmentation and labeling approach, which Within this framework, we explored a variety of offer advantages over previous learning methods effective feature representations for Chinese phrase by explicitly incorporating the internal structural chunking. The experimental results on Chinese feature and the correlations between the adjacent chunking corpus as well as English chunking chunks. To some extent, our model is similar to corpus show that the use of chunk-level features Semi-Markov Conditional Random Fields (called a can lead to significant performance improvement, Semi-CRF), in which the segmentation and

558

labeling can also be done directly (Sarawagi and with segmented words and POS tags to an output y Cohen, 2004). However, Semi-CRF just models with tagged chunk types, like the S1 in Example 1. label dependency, and it cannot capture more The joint model considers all possible chunk correlations between adjacent chunks, as is done in boundaries and corresponding chunk types in the our approach. The limitation of Semi-CRF leads to sentence, and chooses the overall best output. This its relatively low performance. kind of parser reads the input sentences from left to right, predicts whether current segment of 3 Problem Formulation continuous words is some type of chunk. After one chunk is found, parser move on and search for next 3.1 Chunk Types possible chunk. Given a sentence x, let y denote an output tagged Unlike English chunking, there is not a with chunk types, and GEN a function that benchmarking corpus for Chinese chunking. We enumerates a set of segmentation and labeling follow the studies in (Chen et al. 2006) so that a candidates GEN(x) for x. A parser is to solve the more direct comparison with state-of-the-art following “argmax” problem: systems for Chinese chunking would be possible. T ywyˆ = arg max⋅F ( ) There are 12 types of chunks: ADJP, ADVP, CLP, yGENxÎ () DNP, DP, DVP, LCP, LST, NP, PP, QP and VP in ||y (1) T = arg max wy⋅ f ([1..i ] ) the chunking corpus (Xue et al., 2000). The yGENxÎ () å training and test corpus can be extracted from i=1 CTB4 with a public tool, as depicted in (Chen et al. 2006). where F and f are global and local feature maps and w is the parameter vector to learn. The inner 3.2 Sequence Labeling Approaches to Phrase product wyT ⋅f() can be seen as the confidence Chunking [1..i ] score of whether yi is a chunk. The parser takes into The standard approach to phrase chunking is to use account confidence score of each chunk, by using tagging techniques with a BIO tag set. Words in the sum of local scores as its criteria. Markov the input text are tagged with one of B for the assumption is necessary for computation, so f is beginning of a contiguous segment, I for the inside usually defined on a limited history. of a contiguous segment, or O for outside a The main advantage of the joint segmentation segment. For instance, the sentence (word and labeling approach to phrase chunking is to segmented and POS tagged) "他/NR(He) 到达 allow for integrating both the internal structural /VV(reached) 北京/NR(Beijing) 机场 features and the correlations between the adjacent /NN(airport) 。/PU" will be tagged as follows: chunks for prediction. The two basic components Example 1: of our model are decoding and learning algorithms, S1: [NP 他][VP 到达][NP 北京/机场][O 。] which are described in the following sections. S2: 他/B-NP 到达/B-VP 北京/B-NP 机场/I- 4 Decoding NP 。/O Here S1 denotes that the sentence is tagged with The inference technique is one of the most chunk types, and S2 denotes that the sentence is important components for a joint segmentation and tagged with chunk tags based on the BIO-based labeling model. In this section, we propose a model. With the data representation like the S2, the dynamic programming algorithm with pruning to problem of phrase chunking can be reduced to a efficiently produce the optimal output. sequence labeling task. 4.1 Algorithm Description 3.3 Phrase Chunking via a Joint Given an input sentence x, the decoding algorithm Segmentation and Labeling Approach searches for the highest-scored output with To tackle the problems with the sequence labeling recognized chunks. The search space of combined approaches to phrase chunking, we formulate it as candidates in the joint segmentation and labeling a joint problem, which maps a Chinese sentence x task is very large, which is an exponential growth

559

in the number of possible candidates with found by solving all subproblems in a bottom-up increasing sentence size. The rate of growth is fashion. O(2nTn) for the joint system, where n is the length The pseudo code for this algorithm is shown in of the sentence and T is the number of chunk types. Figure 1. It works by filling an n by n by T table It is natural to use some greedy heuristic search chart, where n is the number of words in the input algorithms for inference in some similar joint sentence sent, and T is the number of chunk types. problems (Zhang and Clark, 2008; Zhang and chart[b,e,t] records the value of subproblem Clark, 2010). However, the greedy heuristic search V(b,e,t). chart[0, e, t] can be computed directly for algorithms only explore a fraction of the whole e = 0..M-1 and for chunk type t=1..T. The final space (even with beam search) as opposed to output is the best among chart[b,n-1,t], with b= dynamic programming. Additionally, a specific n-M..n-1, and t=1..T. advantage of the dynamic programming algorithm Inputs: sentence sent (word segmented and POS is that constraints required in a valid prediction tagged) sequence can be handled in a principled way. We Variables: show that dynamic programming is in fact possible word index b for the start of chunk; for this joint problem, by introducing some word index e for the end of chunk; effective pruning schemes. word index p for the start of the previous chunk. To make the inference tractable, we first make a chunk type index t for the current chunk; first-order Markov assumption on the features used chunk type index t¢ for the previous chunk; in our model. In other words, we assume that the Initialization: chunk c and the corresponding label t are only i i for e = 0.. M-1: associated with the preceding chunk c and the i-1 for t =1..T: label t . Suppose that the input sentence has n i-1 chart[0,e,t] ←single chunk sent[0,e] and type t words and the constant M is the maximum chunk Algorithm: length in the training corpus. Let V(b,e,t) denote for e = 0..n-1: the highest-scored segmentation and labeling with for b = (e-M)..e: the last chunk starting at word index b, ending at for t =1..T: word index e and the last chunk type being t. One chart[b,e,t]←the highest scored segmentation way to find the highest-scored segmentation and and labeling among those derived by labeling for the input sentence is to first calculate ¢ the V(b,n-1,t) for all possible start position b∈(n- combining chart[p,b-1,t ] with sent[b,e] M)..n-1, and all possible chunk type t, respectively, and chunk type t, for p = (b-M)..b-1, and then pick the highest-scored one from these t¢ =1..T. candidates. In order to compute V(b,n-1,t), the last Outputs: the highest scored segmentation and chunk needs to be combined with all possible labeling among chart[b,n-1,t], for b=n-M..n-1, t different segmentations of words (b-M)..b-1 and all =1..T. possible different chunk types so that the highest- Figure 1: A dynamic-programming algorithm for scored can be selected. According to the principle phrase chunking. of optimality, the highest-scored among the segmentations of words (b-M)..b-1 and all possible 4.2 Pruning chunk types with the last chunk being word b¢ ..b- The time complexity of the above algorithm is 1 and the last chunk type being t ¢ will also give O(M2T2n), where M is the maximum chunk size. It the highest score when combined with the word is linear in the length of sentence. However, the b..n-1 and tag t. In this way, the search task is constant in the O is relatively large. In practice, the reduced recursively into smaller subproblems, search space contains a large number of invalid where in the base case the subproblems V(0,e,t) for partial candidates, which make the algorithm slow. e∈0..M-1, and each possible chunk type t, are In this section we describe three partial output solved in straightforward manner. And the final pruning schemes which are helpful in speeding up highest-scored segmentation and labeling can be the algorithm.

560

Firstly, we collect chunk type transition is very flexible with respect to the loss function. information between chunk types by observing Any loss function on the output is compatible with every pair of adjacent chunks in the training corpus, MIRA since it does not require the loss to factor and record a chunk type transition matrix. For according to the output, which enables our model example, from the Chinese Treebank that we used to be optimized with respect to evaluation metrics for our experiments, a transition from chunk type directly. Figure 2 outlines the generic online ADJP to ADVP does not occur in the training learning algorithm (McDonald, 2006) used in our corpus, the corresponding matrix element is set to framework. false, true otherwise. During decoding, the chunk MIRA updates the parameter vector w with two type transition information is used to prune constraints: (1) the positive example must have a unlikely combinations between current chunk and higher score by a given margin, and (2) the change the preceding chunk by their chunk types. to w should be minimal. This second constraint is Secondly, a POS tag dictionary is used to record to reduce fluctuations in w. In particular, we use a POS tags associated with each chunk type. generalized version of MIRA (Crammer et al., Specifically, for each chunk type, we record all 2005; McDonald, 2006) that can incorporate k-best POS tags appearing in this type of chunk in the decoding in the update procedure. training corpus. During decoding, a segment of Input: Training set Sxy={( , )}T continuous words that contains only allowed POS ttt=1 (0) tags according to the POS tag dictionary will be 1: w = 0; v = 0; i = 0 considered to be a valid chunk candidate. 2: for iter = 1 to N do Finally, the system records the maximum 3: for t = 1 to T do (i+1) (i) number of words for each type of chunk in the 4: w = update w according to (xt, yt) (i+1) training corpus. For example, in the Chinese 5: v = v + w Treebank, most types of chunks have one to three 6: i = i + 1 words. The few chunk types that are seen with 7: end for length bigger than ten are NP, QP and ADJP. 8: end for During decoding, the chunk candidate whose 9: w = v/(N × T) length is greater than the maximum chunk length Output: weight vector w associated with its chunk type will be discarded. Figure 2: Generic Online Learning Algorithm For the above pruning schemes, development tests show that it improves the speed significantly, In each iteration, MIRA updates the weight while having a very small negative influence on vector w by keeping the norm of the change in the the accuracy. weight vector as small as possible. Within this framework, we can formulate the optimization 5 Learning problem as follows (McDonald, 2006): (1)ii+ () www=-arg minw 5.1 Discriminative Online Training ()i (2) st.. "Î y¢ bestkt ( x ; w ): By defining features, a candidate output y is TT wywyLyy⋅F-()tt⋅F³ ()¢¢ (,) mapped into a global feature vector, in which each where best(; x w()i ) represents a set of top k-best dimension represents the count of a particular kt (i) feature in the sentence. The learning task is to set outputs for xt given the weight vector w . In our the parameter values w using the training examples implementation, the top k-best outputs are obtained as evidence. with a straightforward k-best extension to the Online learning is an attractive method for the decoding algorithm in section 4.1. The above joint model since it quickly converges within a few quadratic programming (QP) problem can be iterations (McDonald, 2006). We focus on an solved using Hildreth’s algorithm (Yair Censor, online learning algorithm called MIRA, which is a 1997). Replacing Eq. (2) into line 4 of the relaxed, online maximum margin training algorithm in Figure 2, we obtain k-best MIRA. algorithm with the desired accuracy and scalability As shown in (McDonald, 2006), parameter properties (Crammer, 2004). Furthermore, MIRA averaging can effectively avoid overfitting. The

561

final weight vector w is the average of the weight sequence labeling models are also represented in vectors after each iteration. our model. In Table 1, templates 1-4 are SL-type features, where label(w) denotes the label 5.2 Loss Function indicating the position of the word w in the current For the joint segmentation and labeling task, there chunk; len(c) denotes the length of chunk c. For are two alternative loss functions: 0-1 loss and F1 example, given an NP chunk "北京(Beijing) 机场 loss. 0-1 loss gives credit only when the entire (Airport)", which includes two words, the value of output sequence is correct: there is no notion of label("北京") is "B" and the value of label("机场") partially correct solutions. The most common loss is "I". (w) denotes the word bigrams formed function for joint segmentation and labeling by combining the word to the left of w and the one problems is F1 measure over chunks. This is the to the right of w. And the same meaning is for geometric mean of precision and recall over the biPOS(w). Template specitermMatch(c) is used to (properly-labeled) chunk identification task, check the punctuation matching within chunk c for defined as follows. the special terms, as illustrated in section 1.

F 2|y Ç y¢ | Secondly, in our model, we have a chance to Lyy(,)ˆ  1- (3) ||y + |y¢ | treat the chunk candidate as a whole during where the cardinality of y is simply the number of decoding, which means that we can employ more chunks identified. The cardinality of the expressive features in our model than in the intersection is the number of chunks in common. sequence labeling models. In Table 1, templates 5- As can be seen in the definition, one is penalized 13 concern the Internal-type features, where both for identifying too many chunks (penalty in start_word(c) and end_word(c) represent the first the denominator) and for identifying too few word and the last word of chunk c, respectively. (penalty in the numerator). Similarly, start_POS(c) and end_POS(c) represent In our experiments, we will compare the the POS tags associated with the first word and the performance of the systems with different loss last word of chunk c, respectively. These features functions. aim at expressing the formation patterns of the current chunk with respect to words and POS tags. 5.3 Features Template internalWords(c) denotes the Table 1 shows the feature templates for the joint concatenation of words in chunk c, while segmentation and labeling model. In the row for internalPOSs(c) denotes the sequence of POS tags feature templates, c, t, w and p are used to in chunk c using regular expression-like form, as represent a chunk, a chunk type, a word and a POS illustrated in section 1. Finally, in Table 1, templates 14-28 concern the tag, respectively. And c0 and c−1 represent the current chunk and the previous chunk respectively. Correlation-type features, where head(c) denotes the headword extracted from chunk c, and Similarly, w−1, w0 and w1 represent the previous word, the current word and the next word, headPOS(c) denotes the POS tag associated with respectively. the headword in chunk c. These features take into Although it is slightly less natural to do so, part account various aspects of correlations between of the features used in the sequence labeling adjacent chunks. For example, we extracted the models can also be represented in our approach. headwords located in adjacent chunks to form Therefore the features employed in our model can headword bigrams to express semantic dependency be divided into three types: the features similar to between adjacent chunks. To find the headword those used in the sequence labeling models (called within every chunk, we referred to the head- SL-type features), the features describing internal finding rules from (Bikel, 2004), and made a structure of a chunk (called Internal-type features), simple modification to them. For instance, the and the features capturing the correlations between head-finding rule for NP in (Bikel, 2004) is as the adjacent chunks (called Correlation-type follows: features). (NP (r NP NN NT NR QP) (r)) Firstly, some features associated with a single Since the phrases are non-overlapping in our task, label (here refers to label "B" and "I") used in the we simply remove the overlapping phrase tags NP

562

and QP from the rule, and then the rule is modified 6 Experiments as follows: (NP (r NN NT NR) (r)) 6.1 Data Sets and Evaluation Additionally, the different bigrams formed by combining the first word (or POS) and last word Following previous studies on Chinese chunking in (or POS) located in two adjacent chunks can also (Chen et al., 2006), our experiments were capture some correlations between adjacent chunks, performed on the CTB4 dataset. The dataset and templates 17-22 are designed to express this consists of 838 files. In the experiments, we used kind of bigram information. the first 728 files (FID from chtb 001.fid to chtb 899.fid) as training data, and the other 110 files ID Feature template (FID from chtb 900.fid to chtb 1078.fid) as testing 1 wlabel(w) t0 data. The training set consists of 9878 sentences, for all w in c0 and the test set consists of 5920 sentences. The 2 bigram (w) label(w)t0 standard evaluation metrics for this task are for all w in c0 precision p (the fraction of output chunks matching 3 biPOS(w) label(w)t0 the reference chunks), recall r (the fraction of for all w in c0 reference chunks returned), and the F-measure given by F = 2pr/(p + r). 4 w-1w1label(w0) t0 , where len(c0)=1 Our model has two tunable parameters: the 5 start_word(c0)t0 number of training iterations N; the number of top 6 start_POS(c0)t0 k-best outputs. Since we were interested in finding 7 end_word(c0)t0 8 end_POS(c )t an effective feature representation at chunk-level 0 0 for phrase chunking, we fixed N = 10 and k = 5 for 9 wend_word (c ) t 0 0 all experiments. In the following experiments, our where wι c and w end_ word() c 00model has roughly comparable training time to the 10 pend_POS (c0) t0 sequence labeling approach based on CRFs.

where pι c00 and p end_ POS() c 6.2 Chinese NP chunking 11 internalPOSs(c0) t0 12 internalWords(c0) t0 NP is the most important phrase in Chinese 13 specitermMatch(c0) chunking and about 47% phrases in the CTB4 14 t-1t0 Corpus are NPs. In this section, we present the 15 head(c-1)t-1head(c0)t0 results of our approach to NP recognition. 16 headPOS(c-1)t-1headPOS(c0)t0 Table 2 shows the results of the two systems 17 end_word(c-1)t-1start_word(c0)t0 using the same feature representations as defined 18 end_POS(c-1)t-1start_POS(c0)t0 in Table 1, but using different loss functions for 19 end_word(c-1)t-1end_word(c0)t0 learning. As shown, learning with F1 loss can improve the F-score by 0.34% over learning with 20 end_POS(c-1)t-1end_POS(c0)t0 0-1 loss. It is reasonable that the model optimized 21 start_word(c-1)t-1start_word(c0)t0 with respect to evaluation metrics directly can 22 start_POS(c-1)t-1start_POS(c0)t0 achieve higher performance. 23 end_word(c-1)t0 24 end_POS(c-1)t0 Loss Function Precision Recall F1 25 t-1t0start_word(c0) 0-1 loss 91.39 90.93 91.16 26 t-1t0start_POS(c0) F1 loss 92.03 90.98 91.50 27 internalWords(c ) t internalWords(c ) t -1 -1 0 0 Table 2: Experimental results on Chinese NP 28 internalPOSs(c ) t internalPOSs(c ) t -1 -1 0 0 chunking. Table 1: Feature templates. 6.3 Chinese Text Chunking There are 12 different types of phrases in the chunking corpus. Table 3 shows the results from

563

two different systems with different loss functions specifically, learning with F1 loss provides much for learning. Observing the results in Table 3, we better results for ADJP, ADVP, DVP, NP and VP, can see that learning with F1 loss can improve the respectively. And it yields equivalent or F-score by 0.36% over learning with 0-1 loss, comparable results to 0-1 loss in other categories. similar to the case in NP recognition. More

F1 loss 0-1 loss precision recall F1 precision recall F1 ADJP 87.86 87.09 87.47 86.74 86.55 86.64 ADVP 90.66 78.73 84.27 91.91 76.68 83.61 CLP 0.00 0.00 0.00 1.32 5.88 2.15 DNP 99.42 99.93 99.68 99.42 99.95 99.69 DP 99.46 99.76 99.61 99.46 99.76 99.61 DVP 99.61 99.61 99.61 99.22 99.61 99.42 LCP 99.74 99.96 99.85 99.74 99.93 99.84 LST 87.50 52.50 65.63 87.50 52.50 65.63 NP 91.87 91.01 91.44 91.34 90.52 90.93 PP 99.57 99.77 99.67 99.57 99.77 99.67 QP 96.45 96.64 96.55 96.45 97.07 96.76 VP 90.14 90.39 90.26 89.92 89.79 89.85 ALL 92.54 91.68 92.11 92.30 91.20 91.75

Table 3: Experimental results on Chinese text chunking. 6.4 Comparison with Other Models Chen et al. (2006) compared the performance of chunking task, our approach improves performance the state-of-the-art machine learning models for by 0.65% over SVMs, and 0.43% over the voting Chinese chunking, and found that the SVMs method, respectively. approach yields higher accuracy than respective Method F1 CRFs, Transformation-based Learning (TBL) CRFs 89.72 (Megyesi, 2002), and Memory-based Learning (MBL) (Sang, 2002) approaches. NP SVMs 90.62 In this section, we give a comparison and chunking Voting 91.13 analysis between our model and other state-of-the- Ours 91.50 art machine learning models for Chinese NP CRFs 90.74 chunking and text chunking tasks. Performance of Text SVMs 91.46 our model and some of the best results from the chunking Voting 91.68 state-of-the-art systems are summarized in Table 4. Ours 92.11 Row "Voting" refers to the phrase-based voting Table 4: Comparisons of chunking performance for methods based on four basic systems, which are Chinese NP chunking and text chunking. respectively SVMs, CRFs, TBL and MBL, as depicted in (Chen et al., 2006). Observing the In particular, for NP chunking task, the F1-score results in Table 4, we can see that for both NP of our approach is improved by 0.88% in chunking and text chunking tasks, our model comparison with SVMs, the best single system. achieves significant performance improvement Further, we investigated the likely cause for over those state-of-the-art systems in terms of the performance improvement by comparing the F1-score, even for the voting methods. For text recognized results from our system and SVMs

564

respectively. We first sorted NPs by their length, improvement in F1-score brought about by adding and then calculated the F1-scores associated with each type of features. Table 5 shows the accuracy different lengths for the two systems respectively. with various features added to the model. Figure 3 shows the comparison of F1-scores of the First consider the effect of the SL-type features. two systems by the chunk length. In the Chinese If we use only the SL-type features, the system chunking corpus, the max NP length is 27, and achieves slightly lower performance than CRFs or the mean NP length is 1.5. Among all NPs, the SVMs, as shown in Table 4. Since the SL-type NPs with the length 1 account for 81.22%. For the features consist of the features associated with NPs with the length 1, our system gives slight single label, not including the features associated improvement by 0.28% over SVMs. From the with label bigrams. Then, adding the Internal-type figure, we can see that the performance gap grows features to the system results in significant rapidly with the increase of the chunk length. In performance improvement on NP chunking and on particular, the gap between the two systems is text chunking, achieving 2.53% and 1.37%, 27.73% when the length hits 4. But the gap begins respectively. Further, if Correlation-type features to become smaller with further growth of the are used, the F1-scores on NP chunking and on text chunk length. The reasons may include the chunking are improved by 1.01% and 0.66%, following two aspects. First, the number of NPs respectively. The results show a significant impact with the greater length is relatively small in the due to the use of Internal-type features and corpus. Second, the NPs with greater length in Correlation-type features for both NP chunking Chinese corpus often exhibit some typical rules. and text chunking. For example, an NP with length 8 is given as Task Type Feature Type F1 follows. SL-type 87.96 "棉花/NN(cotton) 、/PU 油料/NN(oil) 、/PU 药 NP chunking +Internal-type 90.49 材/NN(drug) 、/PU 蔬菜/NN(vegetable) 等/ETC +Correlation-type 91.50 (et al)". SL-type 90.08 The NP consists of a sequence of nouns simply Text chunking +Internal-type 91.45 separated by a punctuation "、". So it is also easy +Correlation-type 92.11 to be recognized by the sequence labeling approach based on SVMs. In summary, the above Table 5: Test F1-scores for different types of investigation indicates that our system is better at features on Chinese corpus. recognizing the long and complicated phrases 6.6 Performance on Other Languages compared with the sequence labeling approaches. We mainly focused on Chinese chunking in this 95 paper. However, our approach is generally 85 our system SVM applicable to other languages including English, 75 except that the definition of feature templates may 65 be language-specific. To validate this point, we F-score 55 evaluated our system on the CoNLL 2000 data set, 45 a public benchmarking corpus for English 35 chunking (Sang and Buchholz 2000). The training 12345678>8 The length of NP set consists of 8936 sentences, and the test set consists of 2012 sentences. Figure 3: Comparison of F1-scores of NP We conducted both the NP-chunking and text recognition on Chinese corpus by the chunk length. chunking experiments on this data set with our 6.5 Impact of Different Types of Features approach, using the same feature templates as in Chinese chunking task excluding template 13. To Our phrase chunking model is highly dependent find the headword within every chunk, we referred upon chunk-level information. To establish the to the head-finding rules from (Collins, 1999), and impact of each type of feature (SL-type, Internal- made a simple modification to them in a similar type, Correlation-type), we look at the way as in Chinese. As we can see from Table 6,

565

our model is able to achieve better performance and insightful comments from the three compared with state-of-the-art systems. Table 6 anonymous reviewers. also shows state-of-the-art performance for both NP-chunking and text chunking tasks. LDCRF's References results presented in (Sun et al., 2008) are the state- Steven P. Abney. 1991. Parsing by chunks. In Robert C. of-the-art for the NP chunking task, and SVM's Berwick, Steven P. Abney, and Carol Tenny, editors, results presented in (Wu et al., 2006) are the state- Principle-Based Parsing , pages 257-278. Kluwer of-the-art for the text chunking task. Academic Publishers. Moreover, the performance should be further improved if some additional features tailored for Daniel M, Bikel. 2004. On the Parameter Space of Generative Lexicalized Statistical Parsing Models. English chunking are employed in our model. For Ph.D. thesis, University of Pennsylvania. example, we can introduce an orthographic feature type called Token feature and the affix feature into Wenliang Chen, Yujie Zhang, and Hitoshi Isahara. 2006. the model, as used in (Wu et al., 2006). An empirical study of Chinese chunking. In Proceedings of the COLING/ACL 2006 Main Method Precision Recall F1 Conference Poster Sessions, pages 97-104. NP Ours 94.79 94.65 94.72 Michael Collins. 2002. Discriminative training methods chunking LDCRF 94.65 94.03 94.34 for hidden Markov models: Theory and experiments Text Ours 94.31 94.12 94.22 with perceptron algorithms. In Proc. EMNLP-02. chunking SVMs 94.12 94.13 94.12 Michael Collins. 1999. Head-Driven Statistical Models Table 6: Performance on English corpus. for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. 7 Conclusions and Future Work Koby Crammer. 2004. Online Learning of Complex Categorial Problems. Hebrew University of In this paper we have presented a novel approach Jerusalem, PhD Thesis. to phrase chunking by formulating it as a joint segmentation and labeling problem. One important Taku Kudo and Yuji Matsumoto. 2001. Chunking with advantage of our approach is that it provides a support vector machines. In Proceedings of natural formulation to exploit chunk-level features. NAACL01. The experimental results on both Chinese chunking Koby Crammer, Ryan McDonald, and Fernando Pereira. and English chunking tasks show that the use of 2005. Scalable large-margin online learning for chunk-level features can lead to significant structured classification. In NIPS Workshop on performance improvement and that our approach Learning With Structured Outputs. outperforms the best in the literature. Heng Li, Jonathan J. Webster, Chunyu Kit, and Future work mainly includes the following two Tianshun Yao. 2003. Transductive hmm based aspects. Firstly, we will explore applying external chinese text chunking. In Proceedings of IEEE information, such as semantic knowledge, to NLPKE2003, pages 257-262, Beijing, China. represent the chunk-level features, and then Ryan McDonald, Femando Pereira, Kiril Ribarow, and incorporate them into our model to improve the Jan Hajic. 2005. Non-projective dependency parsing performance. Secondly, we plan to apply our using spanning tree algorithms. In Proceedings of approach to other joint segmentation and labeling HLT/EMNLP, pages 523-530. tasks, such as clause identification and named Ryan. McDonald, K. Crammer, and F. Pereira, 2005. entity recognition. Flexible with Structured Multilabel Classification. In Proceedings Acknowledgments HLT/EMNLP, pages 987- 994. This research is supported by Projects 61073119, Ryan McDonald. 2006. Discriminative Training and 60773173 under the National Natural Science Spanning Tree Algorithms for Dependency Parsing. Foundation of China, and project BK2010547 University of Pennsylvania, PhD Thesis. under the Jiangsu Natural Science Foundation of Beata Megyesi. 2002. Shallow parsing with pos taggers China. We would also like to thank the excellent and linguistic features. Journal of Machine Learning Research, 2:639-668.

566

Antonio Molina and Ferran Pla. 2002. Shallow parsing single discriminative model. In Proceedings of using specialized hmms. Journal of Machine EMNLP, pages 843-852. Learning Research., 2:595- 613.

E.F.T.K Sang and S. Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings CoNLL-00, pages 127-132. Sunita Sarawagi and W. Cohen. 2004. Semi-markov conditional random fields for information extraction. In Proceedings of NIPS 17, pages 1185–1192. Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL03. Xu Sun, Louis-Philippe Morency, Daisuke Okanohara, and Jun’ichi Tsujii. 2008. Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 841–848. Yongmei Tan, Tianshun Yao, Qing Chen, and Jingbo Zhu. 2004. Chinese chunk identification using svms plus sigmoid. In IJCNLP, pages 527-536. Yongmei Tan, Tianshun Yao, Qing Chen, and Jingbo Zhu. 2005. Applying conditional random fields to chinese shallow parsing. In Proceedings of CICLing- 2005, pages 167-176. Erik F. Tjong Kim Sang. 2002. Memory-based shallow parsing. JMLR, 2(3):559-594. Yu-Chieh Wu, Chia-Hui Chang, and Yue-Shi Lee. 2006. A general and multi-lingual phrase chunking model based on masking method. In Proceedings of 7th International Conference on Intelligent and Computational Linguistics, pages 144-155. Nianwen Xue, Fei Xia, Shizhe Huang, and Anthony Kroch. 2000. The bracketing guidelines for the penn chinese treebank. Technical report, University of Pennsylvania. Stavros A. Zenios Yair Censor. 1997. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press. Tong Zhang, F. Damerau, and D. Johnson. 2002. Text chunking based on a generalization of winnow. Journal of Machine Learning Research, 2:615-637. Yue Zhang and Stephen Clark. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of ACL/HLT, pages 888- 896. Yue Zhang and Stephen Clark. 2010. A fast decoder for joint word segmentation and POS-tagging using a

567