Gaussian Process Classification for Segmenting and Annotating

Gaussian Process Classification for Segmenting and Annotating Sequences Yasemin Altun [email protected] Department of Computer Science, Brown University, Providence, RI 02912 USA Thomas Hofmann [email protected] Department of Computer Science, Brown University, Providence, RI 02912 USA Max Planck Institute for Biological Cybernetics, 72076 Tubingen,¨ Germany Alexander J. Smola [email protected] Machine Learning Group, RSISE , Australian National University, Canberra, ACT 0200, Australia Abstract servations, using models that take possible interde- Many real-world classification tasks involve pendencies between label variables into account. This the prediction of multiple, inter-dependent scenario subsumes problems of sequence segmentation class labels. A prototypical case of this sort and annotation, which are ubiquitous in areas such as deals with prediction of a sequence of la- natural language processing, speech recognition, and bels for a sequence of observations. Such computational biology. problems arise naturally in the context of The most common approach to sequence labeling is annotating and segmenting observation se- based on Hidden Markov Models (HMMs), which de- quences. This paper generalizes Gaussian fine a generative probabilistic model for labeled obser- Process classification to predict multiple la- vation sequences. In recent years, the state-of-the-art bels by taking dependencies between neigh- method for sequence learning is Conditional Random boring labels into account. Our approach Fields (CRFs) introduced by Lafferty et al. (Lafferty is motivated by the desire to retain rigor- et al., 2001). In most general terms, CRFs define a ous probabilistic semantics, while overcom- conditional model over label sequences given an ob- ing limitations of parametric methods like servation sequence in terms of an exponential family; Conditional Random Fields, which exhibit they are thus a natural generalization of logistic re- conceptual and computational difficulties in gression to the problem of label sequence prediction. high-dimensional input spaces. Experiments Other related work on this subject includes Maximum on named entity recognition and pitch ac- Entropy Markov models (McCallum et al., 2000) and cent prediction tasks demonstrate the com- the Markovian model of (Punyakanok & Roth, 2000). petitiveness of our approach. There have also been attempts to extend other discriminative methods such as AdaBoost (Altun et al., 1. Introduction 2003a), perceptron learning (Collins, 2002), and Sup- port Vector Machines (SVMs) (Altun et al., 2003b; Multiclass classification refers to the problem of as- Taskar et al., 2004) to the label sequence learning prob- signing class labels to instances where labels belong lem. The latter have experimentally compared favor- to some finite set of elements. Often, however, the ably to other discriminative methods, including CRFs. instances to be labeled do not occur in isolation, but Moreover, they have the conceptual advantage of be- rather in observation sequences. One is then interested ing compatible with implicit data representations via in predicting the joint label configuration, i.e. the se- kernel functions. quence of labels corresponding to a sequence of ob- In this paper, we investigate the use of Gaussian Appearing in Proceedings of the 21 st International Confer- Process (GP) classification (Gibbs & MacKay, 2000; ence on Machine Learning, Banff, Canada, 2004. Copyright Williams & Barber, 1998) for label sequences. The 2004 by the authors. main motivation for pursuing this direction is to com- bine the best of both worlds from CRFs and SVMs. 2. Gaussian Process Classification More specifically, we would like to preserve the main strength of CRFs, which we see in its rigorous prob- In supervised classification, we are given a training set x abilistic semantics. There are two important advan- of n labeled instances or observations ( i; yi) with yi 2 tages of a probabilistic model. First, it is very intuitive f1; : : : ; mg, drawn i.i.d. from an unknown, but fixed, x to incorporate prior knowledge within a probabilistic joint probability distribution p( ; y). We denote the X x x framework. Second, in addition to predicting the best training observations and labels by = ( 1; : : : ; n) y labels, one can compute posterior label probabilities and = (y1; : : : ; yn), respectively. and thus derive confidence scores for predictions. This GP classification constructs a two-stage model for the is a valuable property in particular for applications conditional probability distribution p(yjx) by intro- requiring a cascaded architecture of classifiers. Con- ducing an intermediate, unobserved stochastic process fidence scores can be propagated to subsequent pro- u ≡ (u(x; y)) where u(x; y) can be considered a com- cessing stages or used to abstain on certain predic- patibility measure of an observation x and a label y. tions. The other design goal is the ability to use kernel Given an instantiation of the stochastic process, we functions in order to construct and learn in Reproduc- assume that the conditional probability p(yjx; u) only ing Kernel Hilbert Spaces (RKHS), thereby overcom- depends on the values of u at the input x via a multi- ing the limitations of (finite-dimensional) parametric nomial response model, i.e. statistical models. exp(u(x; y)) A second, independent objective of our work is to p(yjx; u) = p(yju(x; ·)) = m 0 (1) 0 exp(u(x; y )) gain clarification with respect to two aspects on which y =1 CRFs and the SVM-based methods differ, the first as- It is furthermore assumed thatPthe stochastic process u pect being the loss function (logistic loss vs. hinge is a zero mean Gaussian process with covariance func- loss), and the second aspect being the mechanism tion C, typically a kernel function. An additional as- used for constructing the hypothesis space (parametric sumption typically made in multiclass GP classifica- vs. RKHS). tion is that the processes u(·; y) and u(·; y0) are uncor- GPs are non-parametric tools to perform Bayesian in- related for y 6= y0 (Williams & Barber, 1998). ference, which { like SVMs { make use of the kernel For notational convenience, we will identify u with the trick to work in high (possibly infinite) dimensional relevant restriction of u to the training patterns X and spaces. Like other discriminative methods, GPs pre- represent it as a n × m matrix. For simplicity we will dict single variables and do not take into account any (in slight abuse of notation) also think of u as a vec- dependency structure in case of multiple label predic- tor with multi-index (i; y). Moreover we will denote tions. Our goal is to generalize GPs to predict label 1 by K the kernel matrix with entries K 0 = sequences. While computationally demanding, recent (i;y);(j;y ) C((x ; y); (x ; y0)). Notice that under the above as- progress on sparse approximation methods for GPs, i j sumptions K has a block diagonal structure with e.g. (Csat'o & Opper, 2002; Smola & Bartlett, 2000; blocks K(y) = (K (y)), K (y) ≡ C (x ; x ), where Seeger et al., 2003; Zhu & Hastie, 2001), suggest that ij ij y i j C is a class-specific covariance function. scalable GP label sequence learning may be an achiev- y able goal. Exploiting the compositionality of the ker- Following a Bayesian approach, the prediction of a la- nel function, we derive a gradient-based optimization bel for a new observation x is obtained by computing method for GP sequence classification. Moreover, we the posterior probability distribution over labels and present a column generation algorithm that performs selecting the label that has the highest probability: a sparse approximation of the solution. The rest of the paper is organized as follows: In Sec- p(yjX; y; x) = p(yju(x; ·)) p(ujX; y) du (2) tion 2, we introduce Gaussian Process classification. Z Then, we present our formulation of Gaussian Pro- Thus, one needs to integrate out all n · m latent cess sequence classification (GPSC) in Section 3 and variables of u. Since this is in general intractable, describe the proposed optimization algorithms in Sec- it is common to perform a saddle-point approximation 4. Finally, we report some experimental results tion of the integral around the optimal point esti- using real-world data for named entity classification 1Here and below, we will make extensive use of multi- and pitch accent prediction in Section 5. indices. We will put parentheses around a comma- separated list of indices to denote a multi-index and use two comma-separated multi-indices to refer to matrix elements. mate, which is the maximum a posterior (MAP) es- has to resort to approximate solutions. Various ap- timate: p(yjX; y; x) ≈ p(yjumap(x; ·)) where umap = proximation schemes have been studied to that ex- argmaxu log p(ujX; y). Exploiting the conditional in- tent: Laplace approximation (Williams & Barber, dependence assumptions, the posterior of u can { up 1998; Williams & Seeger, 2000; Zhu & Hastie, 2001), to a multiplicative constant { be written as variational methods (Jaakkola & Jordan, 1996), mean n field approximations (Opper & Winther, 2000), and expectation propagation (Minka, 2001; Seeger et al., p(ujX; y) / p(u) p(yiju(xi; ·)) (3) i=1 2003). Performing these methods usually involves the Y computation of the Hessian matrix as well as the in- Combining the GP prior over u and the conditional version of K, a nm×nm matrix, which is not tractable model in (1) yields the more specific expression for large data sets (of size n) and/or large label sets n (of size m). Several techniques have been proposed to approximate K such that the inversion of the approx- log p(ujX; y) = u(xi; yi) − log exp(u(xi; y)) i=1 " y # imating matrix is tractable (cf. (Schölkopf & Smola, X X 1 2002) for references on such methods). One can also − uT K−1u + const. (4) try to solve (6) using greedy optimization methods as 2 proposed in (Bennett et al., 2002).

Load more