Proceedings of HLT-NAACL 2003 Main Papers , pp. 134-141 Edmonton, May-June 2003

Shallow with Conditional Random Fields

Fei Sha and Fernando Pereira Department of Computer and Information Science University of Pennsylvania 200 South 33rd Street, Philadelphia, PA 19104 (feisha|pereira)@cis.upenn.edu

Abstract which is now the standard evaluation task for shallow parsing. Conditional random fields for sequence label- Most previous work used two main machine-learning ing offer advantages over both generative mod- approaches to sequence labeling. The first approach re- els like HMMs and classifiers applied at each lies on k-order generative probabilistic models of paired sequence position. Among sequence labeling input sequences and label sequences, for instance hidden tasks in language processing, shallow parsing Markov models (HMMs) (Freitag and McCallum, 2000; has received much attention, with the devel- Kupiec, 1992) or multilevel Markov models (Bikel et al., opment of standard evaluation datasets and ex- 1999). The second approach views the sequence labeling tensive comparison among methods. We show problem as a sequence of classification problems, one for here how to train a conditional random field to each of the labels in the sequence. The classification re- achieve performance as good as any reported sult at each position may depend on the whole input and base noun-phrase chunking method on the on the previous k classifications. 1 CoNLL task, and better than any reported sin- The generative approach provides well-understood gle model. Improved training methods based training and decoding algorithms for HMMs and more on modern optimization algorithms were crit- general graphical models. However, effective genera- ical in achieving these results. We present ex- tive models require stringent conditional independence tensive comparisons between models and train- assumptions. For instance, it is not practical to make the ing methods that confirm and strengthen pre- label at a given position depend on a window on the in- vious results on shallow parsing and training put sequence as well as the surrounding labels, since the methods for maximum-entropy models. inference problem for the corresponding graphical model would be intractable. Non-independent features of the inputs, such as capitalization, suffixes, and surrounding 1 Introduction , are important in dealing with words unseen in training, but they are difficult to represent in generative Sequence analysis tasks in language and biology are of- models. ten described as mappings from input sequences to se- The sequential classification approach can handle quences of labels encoding the analysis. In language pro- many correlated features, as demonstrated in work on cessing, examples of such tasks include part-of-speech maximum-entropy (McCallum et al., 2000; Ratnaparkhi, tagging, named-entity recognition, and the task we shall 1996) and a variety of other linear classifiers, including focus on here, shallow parsing. Shallow parsing iden- winnow (Punyakanok and Roth, 2001), AdaBoost (Ab- tifies the non-recursive cores of various phrase types in ney et al., 1999), and support-vector machines (Kudo and text, possibly as a precursor to full parsing or informa- Matsumoto, 2001). Furthermore, they are trained to min- tion extraction (Abney, 1991). The paradigmatic shallow- imize some function related to labeling error, leading to parsing problem is NP chunking, which finds the non- smaller error in practice if enough training data are avail- recursive cores of noun phrases called base NPs. The able. In contrast, generative models are trained to max- pioneering work of Ramshaw and Marcus (1995) in- imize the joint probability of the training data, which is troduced NP chunking as a machine-learning problem, with standard datasets and evaluation metrics. The task 1Ramshaw and Marcus (1995) used transformation-based was extended to additional phrase types for the CoNLL- learning (Brill, 1995), which for the present purposes can be 2000 shared task (Tjong Kim Sang and Buchholz, 2000), tought of as a classification-based method. not as closely tied to the accuracy metrics of interest if the more uniform, we also write actual data was not generated by the model, as is always 0 x 0 x the case in practice. s(y, y , , i) = s(y , , i) s(y, x, i) = s(yi, x, i) However, since sequential classifiers are trained to t(yi , yi, x, i) i > 1 make the best local decision, unlike generative mod- t(y, x, i) = −1  0 i = 1 els they cannot trade off decisions at different positions against each other. In other words, sequential classifiers for any state feature s and transition feature t. Typically, are myopic about the impact of their current decision features depend on the inputs around the given position, on later decisions (Bottou, 1991; Lafferty et al., 2001). although they may also depend on global properties of the This forced the best sequential classifier systems to re- input, or be non-zero only at some positions, for instance sort to heuristic combinations of forward-moving and features that pick out the first or last labels. backward-moving sequential classifiers (Kudo and Mat- The CRF’s global feature vector for input sequence x sumoto, 2001). and label sequence y is given by Conditional random fields (CRFs) bring together the best of generative and classification models. Like classi- F (y, x) = f(y, x, i) fication models, they can accommodate many statistically Xi correlated features of the inputs, and they are trained dis- where i ranges over input positions. The conditional criminatively. But like generative models, they can trade probability distribution defined by the CRF is then off decisions at different sequence positions to obtain a globally optimal labeling. Lafferty et al. (2001) showed exp λ · F (Y , X) pλ(Y |X) = (1) that CRFs beat related classification models as well as Zλ(X) HMMs on synthetic data and on a part-of-speech tagging where task. Zλ(x) = exp λ · F (y, x) In the present work, we show that CRFs beat all re- Xy ported single-model NP chunking results on the standard evaluation dataset, and are statistically indistinguishable Any positive conditional distribution p(Y |X) that obeys from the previous best performer, a voting arrangement of the Markov property 24 forward- and backward-looking support-vector clas- p(Yi|{Yj}j6=i, X) = p(Yi|Yi−1, Yi+1, X) sifiers (Kudo and Matsumoto, 2001). To obtain these results, we had to abandon the original iterative scal- can be written in the form (1) for appropriate choice of ing CRF training algorithm for convex optimization al- feature functions and weight vector (Hammersley and gorithms with better convergence properties. We provide Clifford, 1971). detailed comparisons between training methods. The most probable label sequence for input sequence The generalized perceptron proposed by Collins x is (2002) is closely related to CRFs, but the best CRF train- yˆ = arg max pλ(y|x) = arg max λ · F (y, x) ing methods seem to have a slight edge over the general- y y ized perceptron. because Zλ(x) does not depend on y. F (y, x) decom- poses into a sum of terms for consecutive pairs of labels, 2 Conditional Random Fields so the most likely y can be found with the Viterbi algo- rithm. We focus here on conditional random fields on sequences, We train a CRF by maximizing the log-likelihood of a although the notion can be used more generally (Laf- N given training set T = {(xk, y )} , which we assume ferty et al., 2001; Taskar et al., 2002). Such CRFs define k k=1 fixed for the rest of this section: conditional probability distributions p(Y |X) of label se- quences given input sequences. We assume that the ran- Lλ = k log pλ(yk|xk) dom variable sequences X and Y have the same length, = Pk [λ · F (yk, xk) − log Zλ(xk)] and use x = x1 · · · xn and y = y1 · · · yn for the generic P input sequence and label sequence, respectively. To perform this optimization, we seek the zero of the gra- dient A CRF on (X, Y ) is specified by a vector f of local features and a corresponding weight vector λ. Each local ∇Lλ = feature is either a state feature s(y, x, i) or a transition 0 0 feature t(y, y , x, i), where y, y are labels, x an input F (yk, xk) − Epλ(Y |xk )F (Y , xk) (2) sequence, and i an input position. To make the notation Xk   In words, the maximum of the training data likelihood optimization algorithms when many correlated features is reached when the empirical average of the global fea- are involved. Concurrently with the present work, Wal- ture vector equals its model expectation. The expectation lach (2002) tested conjugate gradient and second-order

Epλ(Y |x)F (Y , x) can be computed efficiently using a methods for CRF training, showing significant training variant of the forward-backward algorithm. For a given speed advantages over iterative scaling on a small shal- x, define the transition matrix for position i as low parsing problem. Our work shows that precon- ditioned conjugate-gradient (CG) (Shewchuk, 1994) or 0 0 Mi[y, y ] = exp λ · f(y, y , x, i) limited-memory quasi-Newton (L-BFGS) (Nocedal and Wright, 1999) perform comparably on very large prob- 0 0 Let f be any local feature, fi[y, y ] = f(y, y , x, i), lems (around 3.8 million features). We compare those F (y, x) = i f(yi−1, yi, x, i), and let ∗ denote algorithms to generalized iterative scaling (GIS) (Dar- component-wisePmatrix product. Then roch and Ratcliff, 1972), non-preconditioned CG, and voted perceptron training (Collins, 2002). All algorithms Y x y x y x Epλ(Y |x)F ( , ) = pλ( | )F ( , ) except voted perceptron maximize the penalized log- Xy ∗ 0 likelihood: λ = arg maxλ Lλ. However, for ease of > αi−1(fi ∗ Mi)β exposition, this discussion of training methods uses the = i Zλ(x) unpenalized log-likelihood Lλ. Xi 1> Zλ(x) = αn · 3.1 Preconditioned Conjugate Gradient Conjugate-gradient (CG) methods have been shown to where αi and βi the forward and backward state-cost vectors defined by be very effective in linear and non-linear optimization (Shewchuk, 1994). Instead of searching along the gra-

αi−1Mi 0 < i ≤ n dient, conjugate gradient searches along a carefully cho- αi =  1 i = 0 sen linear combination of the gradient and the previous search direction. M β> 1 ≤ i < n β> = i+1 i+1 CG methods can be accelerated by linearly trans- i 1 i = n  forming the variables with preconditioner (Nocedal and Wright, 1999; Shewchuk, 1994). The purpose of the pre- Therefore, we can use a forward pass to compute the αi conditioner is to improve the condition number of the and a backward bass to compute the βi and accumulate the feature expectations. quadratic form that locally approximates the objective To avoid overfitting, we penalize the likelihood with function, so the inverse of Hessian is reasonable precon- a spherical Gaussian weight prior (Chen and Rosenfeld, ditioner. However, this is not applicable to CRFs for two λ 2 1999): reasons. First, the size of the Hessian is dim( ) , lead- ing to unacceptable space and time requirements for the 0 inversion. In such situations, it is common to use instead Lλ = [λ · F (yk, xk) − log Zλ(xk)] Xk the (inverse of) the diagonal of the Hessian. However in kλk2 our case the Hessian has the form − + const 2σ2 def 2 Hλ = ∇ Lλ with gradient = − {E [F (Y , xk) × F (Y , xk)] Xk 0 ∇Lλ = −EF (Y , xk) × EF (Y , xk)} λ F (y , x ) − E F (Y , x ) − k k pλ(Y |xk ) k σ2 where the expectations are taken with respect to Xk   pλ(Y |xk). Therefore, every Hessian element, includ- 3 Training Methods ing the diagonal ones, involve the expectation of a prod- uct of global feature values. Unfortunately, computing Lafferty et al. (2001) used iterative scaling algorithms those expectations is quadratic on sequence length, as the for CRF training, following earlier work on maximum- forward-backward algorithm can only compute expecta- entropy models for natural language (Berger et al., 1996; tions of quantities that are additive along label sequences. Della Pietra et al., 1997). Those methods are very sim- We solve both problems by discarding the off-diagonal ple and guaranteed to converge, but as Minka (2001) and terms and approximating expectation of the square of a Malouf (2002) showed for classification, their conver- global feature by the expectation of the sum of squares of gence is much slower than that of general-purpose convex the corresponding local features at each position. The ap- proximated diagonal term Hf for feature f has the form Like the familiar perceptron algorithm, this algorithm re- peatedly sweeps over the training instances, updating the 2 Hf = Ef(Y , xk) weight vector as it considers each instance. Instead of 2 taking just the final weight vector, the voted perceptron 0 Mi[y, y ] λ −  f(Y , xk) algorithm takes the average of the t. Collins (2002) re- Zλ(x) Xi Xy,y0 ported and we confirmed that this averaging reduces over-   fitting considerably. If this approximation is semidefinite, which is trivial to check, its inverse is an excellent preconditioner for early 4 Shallow Parsing iterations of CG training. However, when the model is Figure 1 shows the base NPs in an example sentence. Fol- close to the maximum, the approximation becomes un- lowing Ramshaw and Marcus (1995), the input to the stable, which is not surprising since it is based on fea- NP chunker consists of the words in a sentence anno- ture independence assumptions that become invalid as tated automatically with part-of-speech (POS) tags. The the weights of interaction features move away from zero. chunker’s task is to label each with a label indi- Therefore, we disable the preconditioner after a certain cating whether the word is outside a chunk (O), starts number of iterations, determined from held-out data. We a chunk (B), or continues a chunk (I). For example, call this strategy mixed CG training. the tokens in first line of Figure 1 would be labeled 3.2 Limited-Memory Quasi-Newton BIIBIIOBOBIIO. Newton methods for nonlinear optimization use second- 4.1 Data Preparation order (curvature) information to find search directions. NP chunking results have been reported on two slightly As discussed in the previous section, it is not practi- different data sets: the original RM data set of Ramshaw cal to obtain exact curvature information for CRF train- and Marcus (1995), and the modified CoNLL-2000 ver- ing. Limited-memory BFGS (L-BFGS) is a second-order sion of Tjong Kim Sang and Buchholz (2000). Although method that estimates the curvature numerically from the chunk tags in the RM and CoNLL-2000 are somewhat previous gradients and updates, avoiding the need for different, we found no significant accuracy differences an exact Hessian inverse computation. Compared with between models trained on these two data sets. There- preconditioned CG, L-BFGS can also handle large-scale fore, all our results are reported on the CoNLL-2000 data problems but does not require a specialized Hessian ap- set. We also used a development test set, provided by proximations. An earlier study indicates that L-BFGS Michael Collins, derived from WSJ section 21 tagged performs well in maximum-entropy classifier training with the Brill (1995) POS tagger. (Malouf, 2002). There is no theoretical guidance on how much infor- 4.2 CRFs for Shallow Parsing mation from previous steps we should keep to obtain Our chunking CRFs have a second-order Markov depen- sufficiently accurate curvature estimates. In our exper- dency between chunk tags. This is easily encoded by iments, storing 3 to 10 pairs of previous gradients and making the CRF labels pairs of consecutive chunk tags. updates worked well, so the extra memory required over That is, the label at position i is yi = ci−1ci, where ci is preconditioned CG was modest. A more detailed descrip- the chunk tag of word i, one of O, B, or I. Since B must be tion of this method can be found elsewhere (Nocedal and used to start a chunk, the label OI is impossible. In addi- Wright, 1999). tion, successive labels are constrained: yi−1 = ci−2ci−1, 3.3 Voted Perceptron yi = ci−1ci, and c0 = O. These contraints on the model topology are enforced by giving appropriate features a Unlike other methods discussed so far, voted perceptron weight of −∞, forcing all the forbidden labelings to have training (Collins, 2002) attempts to minimize the differ- zero probability. ence between the global feature vector for a training in- Our choice of features was mainly governed by com- stance and the same feature vector for the best-scoring puting power, since we do not use feature selection and labeling of that instance according to the current model. all features are used in training and testing. We use the More precisely, for each training instance the method following factored representation for features computes a weight update f(yi−1, yi, x, i) = p(x, i)q(yi−1, yi) (4) λt+1 = λt + F (yk, xk) − F (yˆk, xk) (3) where p(x, i) is a predicate on the input sequence x and in which yˆk is the Viterbi path current position i and q(yi−1, yi) is a predicate on pairs of labels. For instance, p(x, i) might be “word at posi- yˆk = arg max λt · F (y,xk) y tion i is the” or “the POS tags at positions i − 1, i are Rockwell International Corp. ’s Tulsa unit said it signed a tentative agreement extending its contract with Boeing Co. to provide structural parts for Boeing ’s 747 jetliners .

Figure 1: NP chunks

q(yi−1, yi) p(x, i) 4.3 Parameter Tuning yi = y true 0 yi = y, yi−1 = y As discussed previously, we need a Gaussian weight prior c(yi) = c to reduce overfitting. We also need to choose the num- yi = y wi = w or wi−1 = w ber of training iterations since we found that the best F c(yi) = c wi+1 = w score is attained while the log-likelihood is still improv- wi−2 = w ing. The reasons for this are not clear, but the Gaussian wi+2 = w 0 prior may not be enough to keep the optimization from wi−1 = w , wi = w 0 making weight adjustments that slighly improve training wi+1 = w , wi = w ti = t log-likelihood but cause large F score fluctuations. We ti−1 = t used the development test set mentioned in Section 4.1 to ti+1 = t set the prior and the number of iterations. ti−2 = t ti+2 = t 0 ti−1 = t , ti = t 0 4.4 Evaluation Metric ti−2 = t , ti−1 = t 0 ti = t , ti+1 = t 0 The standard evaluation metrics for a chunker are preci- ti+1 = t , ti+2 = t 00 0 P ti−2 = t , ti−1 = t , ti = t sion (fraction of output chunks that exactly match the 00 0 ti−1 = t , ti = t , ti+1 = t reference chunks), recall R (fraction of reference chunks 00 0 ti = t , ti+1 = t , ti+2 = t returned by the chunker), and their harmonic mean, the F1 score F1 = 2 ∗ P ∗ R/(P + R) (which we call just Table 1: Shallow parsing features F score in what follows). The relationships between F score and labeling error or log-likelihood are not direct, so we report both F score and the other metrics for the models we tested. For comparisons with other reported DT, NN.” Because the label set is finite, such a factoring results we use F score. of f(yi−1, yi, x, i) is always possible, and it allows each input predicate to be evaluated just once for many fea- tures that use it, making it possible to work with millions 4.5 Significance Tests of features on large training sets. Table 1 summarizes the feature set. For a given po- Ideally, comparisons among chunkers would control for feature sets, data preparation, training and test proce- sition i, wi is the word, ti its POS tag, and yi its label. For any label y = c0c, c(y) = c is the corresponding dures, and parameter tuning, and estimate the statistical chunk tag. For example, c(OB) = B. The use of chunk significance of performance differences. Unfortunately, tags as well as labels provides a form of backoff from reported results sometimes leave out details needed for the very small feature counts that may arise in a second- accurate comparisons. We report F scores for comparison order model, while allowing significant associations be- with previous work, but we also give statistical signifi- tween tag pairs and input predicates to be modeled. To cance estimates using McNemar’s test for those methods save time in some of our experiments, we used only the that we evaluated directly. 820,000 features that are supported in the CoNLL train- Testing the significance of F scores is tricky because ing set, that is, the features that are on at least once. For the wrong chunks generated by two chunkers are not our highest F score, we used the complete feature set, directly comparable. Yeh (2000) examined randomized around 3.8 million in the CoNLL training set, which con- tests for estimating the significance of F scores, and in tains all the features whose predicate is on at least once in particular the bootstrap over the test set (Efron and Tib- the training set. The complete feature set may in princi- shirani, 1993; Sang, 2002). However, bootstrap variances ple perform better because it can place negative weights in preliminary experiments were too high to allow any on transitions that should be discouraged if a given pred- conclusions, so we used instead a McNemar paired test icate is on. on labeling disagreements (Gillick and Cox, 1989). 0 Model F score training method time F score Lλ SVM combination 94.39% Precond. CG 130 94.19% -2968 (Kudo and Matsumoto, 2001) Mixed CG 540 94.20% -2990 CRF 94.38% Plain CG 648 94.04% -2967 Generalized winnow 93.89% L-BFGS 84 94.19% -2948 (Zhang et al., 2002) GIS 3700 93.55% -5668 Voted perceptron 94.09% MEMM 93.70% Table 3: Runtime for various training methods

Table 2: NP chunking F scores null hypothesis p-value CRF vs. SVM 0.469 CRF vs. MEMM 0.00109 5 Results CRF vs. voted perceptron 0.116 All the experiments were performed with our Java imple- MEMM vs. voted perceptron 0.0734 mentation of CRFs,designed to handle millions of fea- tures, on 1.7 GHz Pentium IV processors with Linux and Table 4: McNemar’s tests on labeling disagreements IBM Java 1.3.0. Minor variants support voted perceptron (Collins, 2002) and MEMMs (McCallum et al., 2000) with the same efficient feature encoding. GIS, CG, and Mixed CG training converges slightly more slowly L-BFGS were used to train CRFs and MEMMs. than preconditioned CG. On the other hand, CG without preconditioner converges much more slowly than both 5.1 F Scores preconditioned CG and mixed CG training. However, it Table 2 gives representative NP chunking F scores for is still much faster than GIS. We believe that the superior previous work and for our best model, with the com- convergence rate of preconditioned CG is due to the use plete set of 3.8 million features. The last row of the table of approximate second-order information. This is con- gives the score for an MEMM trained with the mixed CG firmed by the performance of L-BFGS, which also uses method using an approximate preconditioner. The pub- approximate second-order information.2 lished F score for voted perceptron is 93.53% with a dif- Although there is no direct relationship between F ferent feature set (Collins, 2002). The improved result scores and log-likelihood, in these experiments F score given here is for the supported feature set; the complete tends to follow log-likelihood. Indeed, Figure 3 shows feature set gives a slightly lower score of 94.07%. Zhang that preconditioned CG training improves test F scores et al. (2002) reported a higher F score (94.38%) with gen- much more rapidly than GIS training. eralized winnow using additional linguistic features that Table 3 compares run times (in minutes) for reaching a were not available to us. target penalized log-likelihood for various training meth- ods with prior σ = 1.0. GIS is the only method that failed 5.2 Convergence Speed to reach the target, after 3,700 iterations. We cannot place All the results in the rest of this section are for the smaller the voted perceptron in this table, as it does not opti- supported set of 820,000 features. Figures 2a and 2b mize log-likelihood and does not use a prior. However, show how preconditioning helps training convergence. it reaches a fairly good F-score above 93% in just two Since each CG iteration involves a line search that may training sweeps, but after that it improves more slowly, to require several forward-backward procedures (typically a somewhat lower score, than preconditioned CG train- between 4 and 5 in our experiments), we plot the progress ing. 0 of penalized log-likelihood Lλ with respect to the num- ber of forward-backward evaluations. The objective func- 5.3 Labeling Accuracy tion increases rapidly, achieving close proximity to the The accuracy rate for individual labeling decisions is maximum in a few iterations (typically 10). In contrast, over-optimistic as an accuracy measure for shallow pars- GIS training increases L0 rather slowly, never reaching λ ing. For instance, if the chunk BIIIIIII is labled as the value achieved by CG. The relative slowness of it- OIIIIIII, the labeling accuracy is 87.5%, but recall is erative scaling is also documented in a recent evaluation 0. However, individual labeling errors provide a more of training methods for maximum-entropy classification convenient basis for statistical significance tests. One (Malouf, 2002). In theory, GIS would eventually con- 0 verge to the Lλ optimum, but in practice convergence 2Although L-BFGS has a slightly higher penalized log- 0 may be so slow that Lλ improvements may fall below likelihood, its log-likelihood on the data is actually lower than numerical accuracy, falsely indicating convergence. that of preconditioned CG and mixed CG training. Comparison of Fast Training Algorithms for CRF Comparison of CG Methods to GIS 0 0

−20000 −5000 −40000

−10000 −60000

−80000 −15000 −100000 −20000 −120000

Penalized Log−likelihood −140000 Penalized Log−likelihood −25000

−160000 −30000 Preconditioned CG Preconditioned CG Mixed CG Training −180000 CG w/o Preconditioner L−BFGS GIS −35000 −200000 6 56 106 156 206 256 0 50 100 150 200 250 300 350 400 450 500 # of Forward−backward evaluations # of Forward−backward evaluations

0 0 (a) Lλ: CG (precond., mixed), L-BFGS (b) Lλ: CG (precond., plain), GIS

Figure 2: Training convergence for various methods

Comparison of CG Methods to GIS features of generative finite-state models and discrimina- 0.95 tive (log-)linear classifiers, and do NP chunking as well 0.9 as or better than “ad hoc” classifier combinations, which 0.85 were the most accurate approach until now. In a longer 0.8 version of this work we will also describe shallow pars-

0.75 ing results for other phrase types. There is no reason why

0.7 the same techniques cannot be used equally successfully

F score for the other types or for other related tasks, such as POS 0.65 tagging or named-entity recognition. 0.6 On the machine-learning side, it would be interest- 0.55 ing to generalize the ideas of large-margin classification Preconditioned CG 0.5 CG w/o Preconditioner to sequence models, strengthening the results of Collins GIS 0.45 (2002) and leading to new optimal training algorithms 0 50 100 150 200 250 300 350 400 450 500 # of Forward−backward evaluations with stronger guarantees against overfitting. On the application side, (log-)linear parsing models Figure 3: Test F scores vs. training time have the potential to supplant the currently dominant lexicalized PCFG models for parsing by allowing much richer feature sets and simpler smoothing, while avoid- such test is McNemar test on paired observations (Gillick ing the label bias problem that may have hindered earlier and Cox, 1989). classifier-based parsers (Ratnaparkhi, 1997). However, With McNemar’s test, we compare the correctness of work in that direction has so far addressed only parse the labeling decisions of two models. The null hypothesis reranking (Collins and Duffy, 2002; Riezler et al., 2002). is that the disagreements (correct vs. incorrect) are due to Full discriminative parser training faces significant algo- chance. Table 4 summarizes the results of tests between rithmic challenges in the relationship between parsing al- the models for which we had labeling decisions. These ternatives and feature values (Geman and Johnson, 2002) tests suggest that MEMMs are significantly less accurate, and in computing feature expectations. but that there are no significant differences in accuracy among the other models. Acknowledgments 6 Conclusions John Lafferty and Andrew McCallum worked with the second author on developing CRFs. McCallum helped We have shown that (log-)linear sequence labeling mod- by the second author implemented the first conjugate- els trained discriminatively with general-purpose opti- gradient trainer for CRFs, which convinced us that train- mization methods are a simple, competitive solution to ing of large CRFs on large datasets would be practical. learning shallow parsers. These models combine the best Michael Collins helped us reproduce his generalized per- cepton results and compare his method with ours. Erik J. Kupiec. Robust part-of-speech tagging using a hidden Tjong Kim Sang, who has created the best online re- Markov model. Computer Speech and Language, 6:225–242, sources on shallow parsing, helped us with details of the 1992. CoNLL-2000 shared task. Taku Kudo provided the out- J. Lafferty, A. McCallum, and F. Pereira. Conditional random put of his SVM chunker for the significance test. fields: Probabilistic models for segmenting and labeling se- quence data. In Proc. ICML-01, pages 282–289, 2001. References R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Proc. CoNLL-2002, 2002. S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and A. McCallum, D. Freitag, and F. Pereira. Maximum entropy C. Tenny, editors, Principle-based Parsing. Kluwer Aca- Markov models for and segmentation. demic Publishers, 1991. In Proc. ICML 2000, pages 591–598, Stanford, California, S. Abney, R. E. Schapire, and Y. Singer. Boosting applied to 2000. tagging and PP attachment. In Proc. EMNLP-VLC, New T. P. Minka. Algorithms for maximum-likelihood logistic re- Brunswick, New Jersey, 1999. ACL. gression. Technical Report 758, CMU Statistics Department, A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A maxi- 2001. mum entropy approach to natural language processing. Com- J. Nocedal and S. J. Wright. Numerical Optimization. Springer, putational Linguistics, 22(1), 1996. 1999. D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algo- V. Punyakanok and D. Roth. The use of classifiers in sequential rithm that learns what’s in a name. , 34: inference. In NIPS 13, pages 995–1001. MIT Press, 2001. 211–231, 1999. L. A. Ramshaw and M. P. Marcus. Text chunking using L. Bottou. Une Approche theorique´ de l’Apprentissage Con- transformation-based learning. In Proc. Third Workshop on nexionniste: Applications a` la Reconnaissance de la Parole. Very Large Corpora. ACL, 1995. PhD thesis, Universite´ de Paris XI, 1991. A. Ratnaparkhi. A maximum entropy model for part-of-speech E. Brill. Transformation-based error-driven learning and natural tagging. In Proc. EMNLP, New Brunswick, New Jersey, language processing: a case study in part of speech tagging. 1996. ACL. Computational Linguistics, 21:543–565, 1995. A. Ratnaparkhi. A linear observed time statistical parser S. F. Chen and R. Rosenfeld. A Gaussian prior for smoothing based on maximum entropy models. In C. Cardie and maximum entropy models. Technical Report CMU-CS-99- R. Weischedel, editors, EMNLP-2. ACL, 1997. 108, Carnegie Mellon University, 1999. S. Riezler, T. H. King, R. M. Kaplan, R. Crouch, J. T. M. Collins. Discriminative training methods for hidden Markov Maxwell III, and M. Johnson. Parsing the Wall Street Journal models: Theory and experiments with perceptron algo- using a lexical-functional grammar and discriminative esti- rithms. In Proc. EMNLP 2002. ACL, 2002. mation techniques. In Proc. 40th ACL, 2002. E. F. T. K. Sang. Memory-based shallow parsing. Journal of M. Collins and N. Duffy. New ranking algorithms for parsing Machine Learning Research, 2:559–594, 2002. and tagging: Kernels over discrete structures, and the voted perceptron. In Proc. 40th ACL, 2002. J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994. URL http:// J. N. Darroch and D. Ratcliff. Generalized iterative scaling for www-2.cs.cmu.edu/˜jrs/jrspapers.html#cg. log-linear models. The Annals of Mathematical Statistics, 43 (5):1470–1480, 1972. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilis- tic models for relational data. In Eighteenth Conference on S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing fea- Uncertainty in Artificial Intelligence, 2002. tures of random fields. IEEE PAMI, 19(4):380–393, 1997. E. F. Tjong Kim Sang and S. Buchholz. Introduction to the B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. CoNLL-2000 shared task: Chunking. In Proc. CoNLL-2000, Chapman & Hall/CRC, 1993. pages 127–132, 2000. D. Freitag and A. McCallum. Information extraction with H. Wallach. Efficient training of conditional random fields. In HMM structures learned by stochastic optimization. In Proc. 6th Annual CLUK Research Colloquium, 2002. Proc. AAAI 2000, 2000. A. Yeh. More accurate tests for the statistical significance of S. Geman and M. Johnson. Dynamic programming for parsing result differences. In COLING-2000, pages 947–953, Saar- and estimation of stochastic unification-based grammars. In bruecken, Germany, 2000. Proc. 40th ACL, 2002. T. Zhang, F. Damerau, and D. Johnson. Text chunking based L. Gillick and S. Cox. Some statistical issues in the compairson on a generalization of winnow. Journal of Machine Learning of algorithms. In International Confer- Research, 2:615–637, 2002. ence on Acoustics Speech and Signal Processing, volume 1, pages 532–535, 1989. J. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished manuscript, 1971. T. Kudo and Y. Matsumoto. Chunking with support vector ma- chines. In Proc. NAACL 2001. ACL, 2001.