Shallow Parsing with Conditional Random Fields

Proceedings of HLT-NAACL 2003 Main Papers , pp. 134-141 Edmonton, May-June 2003 Shallow Parsing with Conditional Random Fields Fei Sha and Fernando Pereira Department of Computer and Information Science University of Pennsylvania 200 South 33rd Street, Philadelphia, PA 19104 (feisha|pereira)@cis.upenn.edu Abstract which is now the standard evaluation task for shallow parsing. Conditional random fields for sequence label- Most previous work used two main machine-learning ing offer advantages over both generative mod- approaches to sequence labeling. The first approach re- els like HMMs and classifiers applied at each lies on k-order generative probabilistic models of paired sequence position. Among sequence labeling input sequences and label sequences, for instance hidden tasks in language processing, shallow parsing Markov models (HMMs) (Freitag and McCallum, 2000; has received much attention, with the devel- Kupiec, 1992) or multilevel Markov models (Bikel et al., opment of standard evaluation datasets and ex- 1999). The second approach views the sequence labeling tensive comparison among methods. We show problem as a sequence of classification problems, one for here how to train a conditional random field to each of the labels in the sequence. The classification re- achieve performance as good as any reported sult at each position may depend on the whole input and base noun-phrase chunking method on the on the previous k classifications. 1 CoNLL task, and better than any reported sin- The generative approach provides well-understood gle model. Improved training methods based training and decoding algorithms for HMMs and more on modern optimization algorithms were crit- general graphical models. However, effective genera- ical in achieving these results. We present ex- tive models require stringent conditional independence tensive comparisons between models and train- assumptions. For instance, it is not practical to make the ing methods that confirm and strengthen pre- label at a given position depend on a window on the in- vious results on shallow parsing and training put sequence as well as the surrounding labels, since the methods for maximum-entropy models. inference problem for the corresponding graphical model would be intractable. Non-independent features of the inputs, such as capitalization, suffixes, and surrounding 1 Introduction words, are important in dealing with words unseen in training, but they are difficult to represent in generative Sequence analysis tasks in language and biology are of- models. ten described as mappings from input sequences to se- The sequential classification approach can handle quences of labels encoding the analysis. In language pro- many correlated features, as demonstrated in work on cessing, examples of such tasks include part-of-speech maximum-entropy (McCallum et al., 2000; Ratnaparkhi, tagging, named-entity recognition, and the task we shall 1996) and a variety of other linear classifiers, including focus on here, shallow parsing. Shallow parsing iden- winnow (Punyakanok and Roth, 2001), AdaBoost (Ab- tifies the non-recursive cores of various phrase types in ney et al., 1999), and support-vector machines (Kudo and text, possibly as a precursor to full parsing or informa- Matsumoto, 2001). Furthermore, they are trained to min- tion extraction (Abney, 1991). The paradigmatic shallow- imize some function related to labeling error, leading to parsing problem is NP chunking, which finds the non- smaller error in practice if enough training data are avail- recursive cores of noun phrases called base NPs. The able. In contrast, generative models are trained to max- pioneering work of Ramshaw and Marcus (1995) in- imize the joint probability of the training data, which is troduced NP chunking as a machine-learning problem, with standard datasets and evaluation metrics. The task 1Ramshaw and Marcus (1995) used transformation-based was extended to additional phrase types for the CoNLL- learning (Brill, 1995), which for the present purposes can be 2000 shared task (Tjong Kim Sang and Buchholz, 2000), tought of as a classification-based method. not as closely tied to the accuracy metrics of interest if the more uniform, we also write actual data was not generated by the model, as is always 0 x 0 x the case in practice. s(y; y ; ; i) = s(y ; ; i) s(y; x; i) = s(yi; x; i) However, since sequential classifiers are trained to t(yi ; yi; x; i) i > 1 make the best local decision, unlike generative mod- t(y; x; i) = −1 0 i = 1 els they cannot trade off decisions at different positions against each other. In other words, sequential classifiers for any state feature s and transition feature t. Typically, are myopic about the impact of their current decision features depend on the inputs around the given position, on later decisions (Bottou, 1991; Lafferty et al., 2001). although they may also depend on global properties of the This forced the best sequential classifier systems to re- input, or be non-zero only at some positions, for instance sort to heuristic combinations of forward-moving and features that pick out the first or last labels. backward-moving sequential classifiers (Kudo and Mat- The CRF's global feature vector for input sequence x sumoto, 2001). and label sequence y is given by Conditional random fields (CRFs) bring together the best of generative and classification models. Like classi- F (y; x) = f(y; x; i) fication models, they can accommodate many statistically Xi correlated features of the inputs, and they are trained dis- where i ranges over input positions. The conditional criminatively. But like generative models, they can trade probability distribution defined by the CRF is then off decisions at different sequence positions to obtain a globally optimal labeling. Lafferty et al. (2001) showed exp λ · F (Y ; X) pλ(Y jX) = (1) that CRFs beat related classification models as well as Zλ(X) HMMs on synthetic data and on a part-of-speech tagging where task. Zλ(x) = exp λ · F (y; x) In the present work, we show that CRFs beat all re- Xy ported single-model NP chunking results on the standard evaluation dataset, and are statistically indistinguishable Any positive conditional distribution p(Y jX) that obeys from the previous best performer, a voting arrangement of the Markov property 24 forward- and backward-looking support-vector clas- p(YijfYjgj6=i; X) = p(YijYi−1; Yi+1; X) sifiers (Kudo and Matsumoto, 2001). To obtain these results, we had to abandon the original iterative scal- can be written in the form (1) for appropriate choice of ing CRF training algorithm for convex optimization al- feature functions and weight vector (Hammersley and gorithms with better convergence properties. We provide Clifford, 1971). detailed comparisons between training methods. The most probable label sequence for input sequence The generalized perceptron proposed by Collins x is (2002) is closely related to CRFs, but the best CRF train- y^ = arg max pλ(yjx) = arg max λ · F (y; x) ing methods seem to have a slight edge over the general- y y ized perceptron. because Zλ(x) does not depend on y. F (y; x) decom- poses into a sum of terms for consecutive pairs of labels, 2 Conditional Random Fields so the most likely y can be found with the Viterbi algorithm. We focus here on conditional random fields on sequences, We train a CRF by maximizing the log-likelihood of a although the notion can be used more generally (Laf- N given training set T = f(xk; y )g , which we assume ferty et al., 2001; Taskar et al., 2002). Such CRFs define k k=1 fixed for the rest of this section: conditional probability distributions p(Y jX) of label sequences given input sequences. We assume that the ran- Lλ = k log pλ(ykjxk) dom variable sequences X and Y have the same length, = Pk [λ · F (yk; xk) − log Zλ(xk)] and use x = x1 · · · xn and y = y1 · · · yn for the generic P input sequence and label sequence, respectively. To perform this optimization, we seek the zero of the gradient A CRF on (X; Y ) is specified by a vector f of local features and a corresponding weight vector λ. Each local rLλ = feature is either a state feature s(y; x; i) or a transition 0 0 feature t(y; y ; x; i), where y; y are labels, x an input F (yk; xk) − Epλ(Y jxk )F (Y ; xk) (2) sequence, and i an input position. To make the notation Xk In words, the maximum of the training data likelihood optimization algorithms when many correlated features is reached when the empirical average of the global fea- are involved. Concurrently with the present work, Wal- ture vector equals its model expectation. The expectation lach (2002) tested conjugate gradient and second-order Epλ(Y jx)F (Y ; x) can be computed efficiently using a methods for CRF training, showing significant training variant of the forward-backward algorithm. For a given speed advantages over iterative scaling on a small shal- x, define the transition matrix for position i as low parsing problem. Our work shows that preconditioned conjugate-gradient (CG) (Shewchuk, 1994) or 0 0 Mi[y; y ] = exp λ · f(y; y ; x; i) limited-memory quasi-Newton (L-BFGS) (Nocedal and Wright, 1999) perform comparably on very large prob- 0 0 Let f be any local feature, fi[y; y ] = f(y; y ; x; i), lems (around 3.8 million features). We compare those F (y; x) = i f(yi−1; yi; x; i), and let ∗ denote algorithms to generalized iterative scaling (GIS) (Dar- component-wisePmatrix product. Then roch and Ratcliff, 1972), non-preconditioned CG, and voted perceptron training (Collins, 2002). All algorithms Y x y x y x Epλ(Y jx)F ( ; ) = pλ( j )F ( ; ) except voted perceptron maximize the penalized log- Xy ∗ 0 likelihood: λ = arg maxλ Lλ.

Load more