<<

Gaussian Process Classification for Segmenting and Annotating Sequences

Yasemin Altun [email protected] Department of Computer Science, Brown University, Providence, RI 02912 USA Thomas Hofmann [email protected] Department of Computer Science, Brown University, Providence, RI 02912 USA Max Planck Institute for Biological Cybernetics, 72076 Tubingen,¨ Germany Alexander J. Smola [email protected] Group, RSISE , Australian National University, Canberra, ACT 0200, Australia

Abstract servations, using models that take possible interde- Many real-world classification tasks involve pendencies between label variables into account. This the prediction of multiple, inter-dependent scenario subsumes problems of sequence segmentation class labels. A prototypical case of this sort and annotation, which are ubiquitous in areas such as deals with prediction of a sequence of la- natural language processing, speech recognition, and bels for a sequence of observations. Such computational biology. problems arise naturally in the context of The most common approach to sequence labeling is annotating and segmenting observation se- based on Hidden Markov Models (HMMs), which de- quences. This paper generalizes Gaussian fine a generative probabilistic model for labeled obser- Process classification to predict multiple la- vation sequences. In recent years, the state-of-the-art bels by taking dependencies between neigh- method for sequence learning is Conditional Random boring labels into account. Our approach Fields (CRFs) introduced by Lafferty et al. (Lafferty is motivated by the desire to retain rigor- et al., 2001). In most general terms, CRFs define a ous probabilistic semantics, while overcom- conditional model over label sequences given an ob- ing limitations of parametric methods like servation sequence in terms of an exponential family; Conditional Random Fields, which exhibit they are thus a natural generalization of logistic re- conceptual and computational difficulties in gression to the problem of label sequence prediction. high-dimensional input spaces. Experiments Other related work on this subject includes Maximum on named entity recognition and pitch ac- Entropy Markov models (McCallum et al., 2000) and cent prediction tasks demonstrate the com- the Markovian model of (Punyakanok & Roth, 2000). petitiveness of our approach. There have also been attempts to extend other dis- criminative methods such as AdaBoost (Altun et al., 1. Introduction 2003a), perceptron learning (Collins, 2002), and Sup- port Vector Machines (SVMs) (Altun et al., 2003b; Multiclass classification refers to the problem of as- Taskar et al., 2004) to the label sequence learning prob- signing class labels to instances where labels belong lem. The latter have experimentally compared favor- to some finite set of elements. Often, however, the ably to other discriminative methods, including CRFs. instances to be labeled do not occur in isolation, but Moreover, they have the conceptual advantage of be- rather in observation sequences. One is then interested ing compatible with implicit data representations via in predicting the joint label configuration, i.e. the se- kernel functions. quence of labels corresponding to a sequence of ob- In this paper, we investigate the use of Gaussian Appearing in Proceedings of the 21 st International Confer- Process (GP) classification (Gibbs & MacKay, 2000; ence on Machine Learning, Banff, Canada, 2004. Copyright Williams & Barber, 1998) for label sequences. The 2004 by the authors. main motivation for pursuing this direction is to com- bine the best of both worlds from CRFs and SVMs. 2. Gaussian Process Classification More specifically, we would like to preserve the main strength of CRFs, which we see in its rigorous prob- In supervised classification, we are given a training set x abilistic semantics. There are two important advan- of n labeled instances or observations ( i, yi) with yi ∈ tages of a probabilistic model. First, it is very intuitive {1, . . . , m}, drawn i.i.d. from an unknown, but fixed, x to incorporate prior knowledge within a probabilistic joint probability distribution p( , y). We denote the X x x framework. Second, in addition to predicting the best training observations and labels by = ( 1, . . . , n) y labels, one can compute posterior label probabilities and = (y1, . . . , yn), respectively. and thus derive confidence scores for predictions. This GP classification constructs a two-stage model for the is a valuable property in particular for applications conditional probability distribution p(y|x) by intro- requiring a cascaded architecture of classifiers. Con- ducing an intermediate, unobserved fidence scores can be propagated to subsequent pro- u ≡ (u(x, y)) where u(x, y) can be considered a com- cessing stages or used to abstain on certain predic- patibility measure of an observation x and a label y. tions. The other design goal is the ability to use kernel Given an instantiation of the stochastic process, we functions in order to construct and learn in Reproduc- assume that the conditional probability p(y|x, u) only ing Kernel Hilbert Spaces (RKHS), thereby overcom- depends on the values of u at the input x via a multi- ing the limitations of (finite-dimensional) parametric nomial response model, i.e. statistical models. exp(u(x, y)) A second, independent objective of our work is to p(y|x, u) = p(y|u(x, ·)) = m 0 (1) 0 exp(u(x, y )) gain clarification with respect to two aspects on which y =1 CRFs and the SVM-based methods differ, the first as- It is furthermore assumed thatPthe stochastic process u pect being the loss (logistic loss vs. hinge is a zero Gaussian process with func- loss), and the second aspect being the mechanism tion C, typically a kernel function. An additional as- used for constructing the hypothesis space (parametric sumption typically made in multiclass GP classifica- vs. RKHS). tion is that the processes u(·, y) and u(·, y0) are uncor- GPs are non-parametric tools to perform Bayesian in- related for y 6= y0 (Williams & Barber, 1998). ference, which – like SVMs – make use of the kernel For notational convenience, we will identify u with the trick to work in high (possibly infinite) dimensional relevant restriction of u to the training patterns X and spaces. Like other discriminative methods, GPs pre- represent it as a n × m . For simplicity we will dict single variables and do not take into account any (in slight abuse of notation) also think of u as a vec- dependency structure in case of multiple label predic- tor with multi-index (i, y). Moreover we will denote tions. Our goal is to generalize GPs to predict label 1 by K the kernel matrix with entries K 0 = sequences. While computationally demanding, recent (i,y),(j,y ) C((x , y), (x , y0)). Notice that under the above as- progress on sparse approximation methods for GPs, i j sumptions K has a block diagonal structure with e.g. (Csat’o & Opper, 2002; Smola & Bartlett, 2000; blocks K(y) = (K (y)), K (y) ≡ C (x , x ), where Seeger et al., 2003; Zhu & Hastie, 2001), suggest that ij ij y i j C is a class-specific . scalable GP label sequence learning may be an achiev- y able goal. Exploiting the compositionality of the ker- Following a Bayesian approach, the prediction of a la- nel function, we derive a gradient-based optimization bel for a new observation x is obtained by computing method for GP sequence classification. Moreover, we the posterior probability distribution over labels and present a column generation algorithm that performs selecting the label that has the highest probability: a sparse approximation of the solution. The rest of the paper is organized as follows: In Sec- p(y|X, y, x) = p(y|u(x, ·)) p(u|X, y) du (2) tion 2, we introduce Gaussian Process classification. Z Then, we present our formulation of Gaussian Pro- Thus, one needs to integrate out all n · m latent cess sequence classification (GPSC) in Section 3 and variables of u. Since this is in general intractable, describe the proposed optimization algorithms in Sec- it is common to perform a saddle-point approxima- tion 4. Finally, we report some experimental results tion of the integral around the optimal point esti- using real-world data for named entity classification 1Here and below, we will make extensive use of multi- and pitch accent prediction in Section 5. indices. We will put parentheses around a comma- separated list of indices to denote a multi-index and use two comma-separated multi-indices to refer to matrix ele- ments. mate, which is the maximum a posterior (MAP) es- has to resort to approximate solutions. Various ap- timate: p(y|X, y, x) ≈ p(y|umap(x, ·)) where umap = proximation schemes have been studied to that ex- argmaxu log p(u|X, y). Exploiting the conditional in- tent: Laplace approximation (Williams & Barber, dependence assumptions, the posterior of u can – up 1998; Williams & Seeger, 2000; Zhu & Hastie, 2001), to a multiplicative constant – be written as variational methods (Jaakkola & Jordan, 1996), mean

n field approximations (Opper & Winther, 2000), and expectation propagation (Minka, 2001; Seeger et al., p(u|X, y) ∝ p(u) p(yi|u(xi, ·)) (3) i=1 2003). Performing these methods usually involves the Y computation of the Hessian matrix as well as the in- Combining the GP prior over u and the conditional version of K, a nm×nm matrix, which is not tractable model in (1) yields the more specific expression for large data sets (of size n) and/or large label sets

n (of size m). Several techniques have been proposed to approximate K such that the inversion of the approx- log p(u|X, y) = u(xi, yi) − log exp(u(xi, y)) i=1 " y # imating matrix is tractable (cf. (Sch¨olkopf & Smola, X X 1 2002) for references on such methods). One can also − uT K−1u + const. (4) try to solve (6) using greedy optimization methods as 2 proposed in (Bennett et al., 2002). The Representer Theorem (Kimeldorf & Wahba, 1971) guarantees that the maximizer of (4) is of the form 3. GP Sequence Classification (GPSC) n m 3.1. Sequence Labeling and GPC map u (xi, y) = α(j,y0)K(i,y),(j,y0) (5) j=1 y0=1 In sequence classification, our goal is to learn a dis- X X criminant function for sequences, i.e. a mapping from with suitably chosen coefficients α. In the block diago- observation sequences X = (x1, x2, . . . , xt, . . . , xT ) to 0 1 2 t T nal case, K(i,y),(j,y0) = 0 for y 6= y and this reduces to label sequences y = (y , y , . . . , y , . . . , y ). There ex- map n t the simpler form u (xi, y) = j=1 α(j,y)Cy(xi, xj ). ists a label y ∈ Σ = {1, . . . , r} for every observation xt in the sequence. Thus, we have T multiclass classifi- Using the representation in (5),P we can rewrite the optimization problem as an objective R parameter- cation problems. Because of the sequence structure of the labels ( i.e. every label yt depends on its neighbor- ized by α. Let e(i,y) be the (i, y)-th unit vector, then T 0 0 ing labels ), one needs to solve these T classification α Ke(i,y) = j,y0 α(j,y )K(i,y),(j,y ) and the negative of Eq. (4) can be written as follows: problems jointly. Then, the problem can be consid- P ered as a multiclass classification where for an obser- n vation sequence of length l, the possible label set Y is T R(α|X, y) = α Kα − log p(yi|xi, α) (6) of size m = rl.3 We call Σ label set of observations i=1 or micro-label set, and Y the set of label sequences of n X n T T T observation sequences or macro-label set. = α Kα − α Ke(i,yi) + log exp(α Ke(i,y)) i=1 i=1 y We assume that a training set of n labeled sequences X X X Z ≡ {(Xi, yi)|i = 1, . . . , n} is available. Using the A comparison between (6) and a similar multiclass notation introduced in the context of GP classification, SVM formulation (Crammer & Singer, 2001; Weston we define p(yi|u(Xi)) as in (1), treating every macro & Watkins, 1999) clarifies the connection between GP label as a separate label in GP multiclass classification classification and SVMs. Their difference lies primar- and using the whole sequence Xi as the input. ily in the utilized loss functions: logistic loss vs. hinge loss. Because the hinge loss truncates values smaller 3.2. Kernels for Labeled Sequences than  to 0, it enforces sparseness in terms of the α parameters. This is not the case for logistic regression The fundamental design decision is then the engineer- as well as other choices of loss functions.2 ing of the kernel function k that determines the kernel matrix K. Notice that the use of a block diagonal ker- For non-linear link functions like the one induced by nel matrix is not an option in the current setting, since Eq. (1), umap cannot be found analytically and one it would prohibit generalizing across label sequences 2Several studies focused on finding sparse solutions of that differ in as little as a single micro-label.

Eq. (6) or optimization problems similar to Eq. (6) (Ben- 3 nett et al., 2002; Girosi, 1997; Smola & Sch¨olkopf, 2000; For notational convenience we will assume that all l Zhu & Hastie, 2001). training sequences are of the same length . We define the kernel function for labeled sequences 3.3. Exploiting Kernel Structure with respect to the feature representation. Inspired In order to carry out this reparameterization more for- by HMMs, we use two types of features: Features that mally we proceed in two steps. The first step consists capture the dependency of the micro-labels on the at- of finding an appropriate low-dimensional summary of tributes of the observations Φ(xs) and features that α. In particular, we are looking for a parameteriza- capture the inter-dependency of micro-labels. As in tion that does not scale with m = rl. The second step other discriminative methods, Φ(xs) can include over- consists of re-writing the objective function in terms lapping attributes of xs as well as attributes of obser- of these new parameters. vations xt where t 6= s. Using stationarity, the inner product between the feature vectors of two observation As we will prove subsequently, the following linear map sequences can be stated as: k = k1 + k2, where Λ extracts the information in α that is relevant for solving (8): k1((X, y), (X¯ , y¯))≡ [[ys = y¯t]]hΦ(xs), Φ(x¯t)i (7a) 2 s,t n·l·r ×n·m X γ ≡ Λα, Λ ∈ {0, 1} (9) 2 s t s+1 t+1 k ((X, y), (X¯ , y¯))≡ [[y =y¯ ∧ y =y¯ ]] (7b) where s,t t t+1 X λ(j,t,σ,τ),(i,y) ≡ δij [[y = σ ∧ y = τ]] (10) k1 couples observations in both sequences that are classified with the same micro-labels at respective po- Notice that each variable λ(j,t,σ,τ),(i,y) encodes 2 sitions. k simply counts the number of consecutive whether the input sequence is the j-th training se- label pairs both label sequences have in common (ir- quence and whether the label sequence y contains respective of the inputs). One can generalize (7) in micro-labels σ and τ at position t and t + 1, respec- various ways, e.g. by using higher order terms between tively. Hence, γ(j,t,σ,τ) is simply the sum of all α(j,y) micro-labels in both contributions, without posing ma- over label sequences y that contain the στ-motif at jor conceptual challenges. position t. k is a linear kernel function for labeled sequences. This We define two reductions derived from γ via further can be generalized to non-linear kernel functions for linear dimension reduction, s t labeled sequences by replacing hΦ(x ), Φ(x¯ )i with a (1) standard kernel function defined over input patterns. γ ≡ Pγ, with P(i,s,σ),(j,t,τ,ρ) = δij δstδστ , (11a) γ(2) ≡ Qγ, with Q = δ δ δ . (11b) We can naively follow the same line of argumentation (i,σ,ζ),(j,t,τ,ρ) ij στ ζρ as in the GPC case of Section 2, evoke the Representer (2) Intuitively, γ(i,σ,τ) is the sum of all α(i,y) over ev- Theorem and ultimately arrive at the objective in (6). ery position in the sequence y that contains στ-motif. Since we need it for subsequent derivations, we will (1) γ , on the other hand, is the sum of all α that restate the objective here i,s,σ (i,y) has σ micro-label at position s in macro-label y. n T T R(α|Z) = α Kα − α Ke(i,yi) We can now show how to represent the kernel matrix i=1 using the previously defined matrices Λ, P, Q and the n X s t gram matrix G with G(i,s),(j,t) = g(xi , xj ). T + log exp α Ke(i,y) (8) Proposition 1. With the definitions from above: i=1 y∈Y X X  K = ΛT K0Λ, K0 ≡ PT HP + QT Q Notice that in the third term, the sum ranges over the macro-label set, Y, which grows exponentially in where H = diag(G, . . . , G).  the sequence length. Therefore, this view suffers from the large cardinality of Y. In order to re-establish Proof. By elementary comparison of coefficients. tractability of this formulation, we use a trick simi- lar to the one deployed in (Taskar et al., 2004) and We now have r2 parameters for every observation xs in reparametrize the objective in terms of an equivalent the training data (nlr2 parameters) and we can rewrite lower dimensional set of parameters. The crucial ob- the objective function in terms of these variables: servation is that the definition of k in (7) is homo- n T 0 T 0 geneous (or stationary). Thus, the absolute positions R(γ|X, y) = γ K γ − γ K Λe(i,yi) of patterns and labels in the sequence are irrelevant. i=1 X This observation can be exploited by re-arranging the n T 0 sums inside the kernel function with the outer sums, + log exp γ K Λe(i,y) (12) i=1 y∈Y i.e. the sums in the objective function. X X  3.4. GPSC and Other Label Sequence quence labeling, this corresponds to computing the ex- Learning Methods pections of micro-labels within different cliques, which is not tractable to compute exactly for large training We now briefly point out the relationship between our sets. In order to minimize R with respect to γ, we pro- approach and the previous discriminative methods of pose a 1st order exact optimization method, which we sequence learning, in particular, CRFs, HM-SVMs and call Dense Gaussian Process Sequence Classification MMMs. (DGPS). CRF is a natural generalization of logistic regression to It is well-known that the derivatives of the log partition label sequence learning. The probability distribution function with respect to γ is simply the expectation of over label sequences given an observation sequence is sufficient : given in Eq. (1), where u(X, y) = hθ, Ψ(X, y)i is a linear discriminative function over some feature repre- T 0 sentation Ψ parameterized with θ. The objective func- ∇γ log exp γ K Λe =EY Λe (14)  (i,y)  (i,Y ) tion of CRFs is the minimization of the negative condi- y∈Y X    tional likelihood of training data. To avoid overfitting,   it is common to multiply the conditional likelihood by where EY denotes an expectation with respect to the a Gaussian with zero mean and diagonal covariance conditional distribution of the label sequence y given matrix K, resulting in an additive term in log scale. the observation sequence Xi. Then, the gradients of R is trivially given by: n T log p(θ|X, y) = − log p(yi|Xi, θ) + θ Kθ (13) i=1 n n X 0 0 0 ∇γR=2K γ − K Λe(i,yi) + K EY Λe(i,Y ) From a Bayesian point of view, CRFs assume a uni- i=1 i=1 form prior p(u), if there is no regularization term. X X  (15) When regularized, CRFs define a Gaussian distribu- tion over a finite vector space θ. In GPSC, on the The remaining challenge is to come-up with an efficient other hand, the prior is defined as a Gaussian dis- way to compute the expectations. First of all, let us tribution over the function space of possibly infinite more explicitly examine these quantities: dimension. Thus, GPSC generalizes CRFs by defin- t t+1 ing a more sophisticated prior on the discriminative EY [(Λe(i,Y ))(j,t,σ,τ)]=δij EY [[Y =σ∧Y =τ]] (16) function u. This prior leads to the ability of using   kernel function in order to construct and learn over In order to compute the above expectations one can Reproducing Kernel Hilbert Spaces. So, GPSC, a non- once again exploit the structure of the kernel and is parametric tool for sequence label- left with the problem of computing probabilities for ing, can overcome the limitations of CRFs, parametric every neighboring micro-label pair (σ, τ) at positions (linear) statistical models. When the kernel that de- (t, t + 1) for all training sequences Xi. The latter can fines the covariance matrix K in GPSC is linear, u in be accomplished by performing the forward-backward both models become equivalent. algorithm over the training data using the transition T The difference between SVM and GP approaches to probability matrix and the observation probability O(i) sequence learning is the utilized loss function over the matrices , which are simply decompositions and K0 training data, i.e. hinge loss vs. log loss. GPSC ob- reshapings of : jective function parameterized with α (Eq. (8)) corre- (2) (2) γ¯ ≡ Rγ , with R = δστ δζρ (17a) sponds to HM-SVMs where the number of parameters (σ,ζ),(i,τ,ρ) (2) scale exponentially with the length of sequences. The T≡ [γ¯ ]r,r (17b) objective function parameterized with γ (Eq. (12)) cor- (i) (1) O = [γ ]n·l,rG(i,.),(.,.) (17c) responds to MMMs, where the number of parameters scale only linearly. where [x]m,n denotes the reshaping operation of a vec- tor x into an m ∗ n matrix, AI,J denotes the |I| ∗ |J| 4. GPSC Optimization Algorithm sub-matrix of A and (.) denotes the set of all possible indices. 4.1. A Dense Algorithm A single optimization step of DGPS is described in Using optimization methods described in Section 2 re- Algorithm 1. The complexity of one optimization step quires the computation of the Hessian matrix. In se- is O(t2) dominated by the forward-backward algorithm t t+1 Algorithm 1 One optimization step of Dense Gaus- EY [[Y =σ∧Y =τ]] , we pick the steepest d direc- sian Process Sequence Classification (DGPS) tions, once the expectations are computed.   Require: Training data (Xi, yi)i=1:n; Proposed pa- One has two options to compute the optimal γ at every rameter values γc iteration: by updating all of the γ parameters selected (1) (2) 1: Initialize γc , γc (Eq. (11)). until now, or alternatively, by updating only the pa- (2) 2: Compute T wrt γc (Eq. (17a), Eq. (17b)). rameters selected in the last iteration. We prefer the 3: for i = 1, . . . , n do latter because of its less expensive iterations. This ap- (i) (1) 4: Compute O wrt γc (Eq. (17c)). proach is in the spirit of a boosting algorithm or the 5: Compute p(yi|Xi, γc) and cyclic coordinate optimization method. t t+1 EY [[Y =σ∧Y =τ]] for all t, σ, τ via forward-backward algorithm using O(i) and T Algorithm 2 Sparse Gaussian Process Sequence Clas-   6: end for sification (SGPS) algorithm.

7: Compute ∇γ R (Eq. (15)). Require: Training data (Xi, yi)i=1:n; Maximum number of coordinates to be selected, p, p < nlr2; Threshold value η for gradients 1: K ← [] 2 over all instances where t = nlr . We propose to use 2: for i = 1, . . . , n do a quasi-Newton method for the optimization process. 3: Compute ∇γ R (Equation 15). Then, the overall complexity is given by O(ηt2) where (i,.) 4: s ← Steepest d directions of ∇γ R η < t2. The memory requirement is given by the size (i,.) 5: Kˆ ← [Kˆ ; Ke ] of γ, O(t). s 6: Optimize R wrt s. During inference, one can find the most likely label 7: Return if ∇γ < η or p coordinates selected. sequence for an observation sequence X by performing 8: end for Viterbi decoding using the transition and observation probability matrices described above. SGPS is described in Algorithm 2. Its complexity is O(p2t) where p is the maximum number of coordinates 4.2. A Sparse Algorithm allowed. While the above method is attractive for small data sets, the computation or the storage of K0 poses a 5. Experiments serious problem when the data set is large. Also, clas- sification of a new observation involves evaluating the 5.1. Pitch Accent Prediction covariance function at nl data points, which is more Pitch Accent Prediction is the task of identifying more than acceptable for many applications. Hence, as in prominent words in a sentence. The micro-label set is the case of standard Gaussian Process Classification of size 2, accented and not-accented. We used phonet- discussed in Section 2, one has to find a method for ically hand-transcribed Switchboard corpus consisting sparse solutions in terms of the γ parameters to speed of 1824 sentences (13K words) (Greenberg et al., 1996). up the training and prediction stages. We extracted probabilistic, acoustic and textual infor- We propose a sparse greedy method, Sparse Gaussian mation from the current, previous and next words for st Process Sequence Classification (SGPS), that is simi- every position in the training data. We used 1 order lar to the method presented by (Bennett et al., 2002). Markov features to capture the dependencies between SGPS starts with an empty matrix Kˆ . At each iter- neighboring labels. ation, SGPS selects a training instance Xi and com- We compared the performance of CRFs and HM- putes the gradients of the parameters associated with SVMs with the GPSC dense and sparse methods ac- Xi, γ(i,.), to select the steepest descent direction(s) cording to their test accuracy in 5-fold cross valida- of R over this subspace. Then Kˆ is augmented with tion. CRFs were regularized and optimized using lim- these columns and SGPS performs optimization of the ited memory BFGS, a limited memory Quasi-Newton current problem using a Quasi-Newton method. This optimization method. When performing experiments process is repeated until the gradients vanish (i.e. they on DGPS, we used polynomial kernels with different are smaller than a threshold value η) or a maximum degrees (denoted with DGPSX in Figure 1a where number of γ coordinates, p, are selected (i.e. some X ∈ {1, 2, 3} is the degree of the polynomial kernel). sparseness level is achieved). Since the bottleneck of We used third order polynomial kernel in HM-SVMs this method is the computation of the expectations, (denoted with SVM3 in Figure 1). As expected, CRFs 0.77 1

Recall 0.76 DGPS2 0.95

0.75 0.9 Precision SGPS2

Accuracy 0.74 0.85

0.73 0.8

0.72 0.75 0 1 2 3 4 5 6 7 8 0.5 0.6 0.7 0.8 0.9 1 Test Accuracy Sparseness % Threshold

Figure 1. Pitch Accent Prediction task results a) Test accuracy of Pitch Accent Prediction task over a window of size 3 using 5-fold cross validation. b) Test accuracy of Pitch Accent Prediction w.r.t. the sparseness of the solution. c) Precision-Recall curves for different threshold probabilities to abstain. and DGPS1 performed very similar. When 2nd or- DGPS. der features were incorporated implicitly using second degree polynomial kernel (DGPS2), the performance 5.2. Named Entity Recognition increased dramatically. Extracting 2nd order features explicitly results in a 12 million dimensional feature Named Entity Recognition (NER), a subtask of In- space, where CRFs slow down dramatically. We ob- formation Extraction, is finding phrases containing served that 3rd order features do not provide signifi- names in a sentences. The micro-label set consists cant improvement over DGPS2. HM-SVM3 performs of the beginning and continuation of person, location, slightly worse than DGPS2. organization and miscellaneous names and non-name. We used a Spanish newswire corpus, which was pro- To investigate how the sparsity of SGPS affects its per- vided for the Special Session of CoNLL 2002 on NER, formance, we report the test accuracy with respect to to randomly select 1000 sentences (21K words). We the sparseness of SGPS solution in Figure 1b. Sparse- used the word and its spelling properties of the cur- ness is measured by the percentage of the parameters rent, previous and next observations. selected by SGPS. The straight line is the performance of DGPS using second degree polynomial kernel. Us- DGPS1 DGPS2 SGPS2 CRF CRF-B ing 1% of the parameters, SGPS achieves 75% accu- Error 4.58 4.39 4.48 4.92 4.56 racy (1.48% less than the accuracy of DGPS). When 7.8% of the parameters are selected, the accuracy is Table 1. Test error of NER over a window of size 3 using 76.18% which is not significantly different than the 5-fold cross validation. performance of DGPS (76.48%). We observed that these parameters were related to 6.2% of the obser- The experimental setup was similar to pitch accent vations along with 1.13 label pairs on average. Thus, prediction task. We compared the performance of during inference one needs to evaluate the kernel func- CRFs with and without the regularizer term (CRF- tion only at 6% of the observations which reduces the R, CRF) with the GPSC dense and sparse methods. inference time dramatically. Qualitatively, the behavior of the different optimiza- tion methods is comparable to the pitch accent predic- In order to experimentally verify how useful the pre- tion task. The results are summarized in Table 1. Sec- dictive probabilities are as confidence scores, we forced ond degree polynomial DGPS outperformed the other DGPS to abstain from predicting a label when the methods. We set the sparseness parameter of SGPS to probability of a micro-label is lower than a threshold 25%, i.e. p = 0.25nlr2, where r = 9 and nl = 21K on value. In Figure 1c, we plot precision-recall values average. SGPS with 25% sparseness achieves an accu- for different thresholds. We observed that the error racy that is only 0.1% below DGPS. We observed that rate for DGPS decreased 8.54%, abstaining on 14.93% 19% of the observations are selected along with 1.32 of the test data. The improvement on the error rate label pairs on average, which that one needs to shows the validity of the probabilities generated by compute only one fifth of the gram matrix. We also tried a sparse algorithm that does not exploit Greenberg, S., Ellis, D., & Hollenback, J. (1996). Insights the kernel structure and optimizes Equation 8 to ob- into spoken language gleaned from phonetic transcripti tain sparse solutions in terms of observation sequences on of the Switchboard corpus. ICSLP96. X and label sequence y, as opposed to SPGS, where Jaakkola, T. S., & Jordan, M. I. (1996). Computing upper the sparse solution is in terms of observations and label and lower bounds on likelihoods in intractable networks. pairs. This method achieved 92.7% of accuracy, hence, In Proc. of the Twelfth Conf. on UAI. was clearly outperformed by all the other methods. Kimeldorf, G., & Wahba, G. (1971). A correspondence between Bayesian estimation and on stochastic processes and smoothing by splines. Annals of Math. Stat., 41(2), 6. Conclusion and Future Work 495–502. We presented GPSC, a generalization of Gaussian Pro- Lafferty, J., McCallum, A., & Pereira, F. (2001). Condi- cess classification to label sequence learning problem. tional Random Fields: Probabilistic models for segment- This method combines the advantages of the rigor- ing and labeling sequence data. Proc. 18th International Conf. on Machine Learning. ous probabilistic semantics of CRFs and overcomes the curse of dimensionality problem using kernels in McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum order to construct and learn over RKHS. The experi- Entropy Markov Models for Information Extraction and Segmentation. Machine Learning: Proceedings of the ments on named entity recognition show the compet- Seventeenth International Conference (ICML 2000). itiveness and the experiments on pitch accent predic- tion show the superiority of our approach in terms of Minka, T. (2001). A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab. the achieved error rate. We also experimentally veri- fied the usefulness of the probabilities obtained from Opper, M., & Winther, O. (2000). Gaussian Processes for GPSC. classification: Mean-field algorithms. Neural Computa- tion, 12, 2655–2684. Acknowledgments Punyakanok, V., & Roth, D. (2000). The use of classifiers in sequential inference. Advances in Neural Information This work was supported by NSF-ITR grants IIS-0312401 Processing Systems. and IIS-0085940. Thanks to Michelle Gregory for providing us the pitch accent data and the valuable features. Sch¨olkopf, B., & Smola, A. J. (2002). Learning with ker- nels. MIT Press. References Seeger, M., Lawrence, N. D., & Herbrich, R. (2003). Fast sparse Gaussian Process methods: The informative vec- Altun, Y., Hofmann, T., & Johnson, M. (2003a). Discrim- tor machine. Advances in Neural Information Processing inative learning for label sequences via boosting. Ad- Systems. vances in Neural Information Processing Systems. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy Gaus- Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003b). Hid- sian Process regression. Advances in Neural Information den markov support vector machines. 20th International Processing Systems. Conference on Machine Learning. Smola, A. J., & Sch¨olkopf, B. (2000). Sparse greedy matrix Bennett, K., Momma, M., & Embrechts, J. (2002). Mark: approximation for machine learning. Proc. 17th Inter- A boosting algorithm for heterogeneous kernel mod- national Conf. on Machine Learning. els. Proceedings of SIGKDD International Conference Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin on Knowledge Discovery and Data Mining. markov networks. Advances in Neural Information Pro- Collins, M. (2002). Discriminative training methods for cessing Systems. Hidden Markov Models: Theory and experiments with Weston, J., & Watkins, C. (1999). Support vector ma- perceptron algorithms. Empirical Methods of Natural chines for multi-class pattern recognition. Proceedings Language Processing (EMNLP). European Symposium on Artificial Neural Networks. Crammer, K., & Singer, Y. (2001). On the algorithmic im- Williams, C. K. I., & Barber, D. (1998). Bayesian clas- plementation of multiclass kernel-based vector machines. sification with Gaussian Processes. IEEE Transactions Journal of Machine Learning Research, 2. on Pattern Analysis and Machine Intelligence, 20, 1342– 1351. Csat’o, L., & Opper, M. (2002). Sparse on-line Gaussian Processes. Neural Computation, 14, 641–668. Williams, C. K. I., & Seeger, M. (2000). Using the nystrom method to speed up kernel machines. Advances in Neural Gibbs, M. N., & MacKay, D. J. C. (2000). Variational Information Processing Systems. Gaussian Process Classifiers. IEEE-NN, 11, 1458. Zhu, & Hastie, T. (2001). Kernel logistic regression and the Girosi, F. (1997). An equivalence between sparse approxi- import vector machine. Advances in Neural Information mation and support vector machines (Technical Report Processing Systems. AIM-1606).