<<

Metric Embedding for Kernel Classification Rules

Bharath K. Sriperumbudur [email protected] Omer A. Lang [email protected] Gert R. G. Lanckriet [email protected] Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093 USA.

Abstract ter 10) is given by  Pn ¡ ¢ In this paper, we consider a smoothing kernel x−Xi  0 if i=1 1{Yi=0}K h Pn ¡ ¢ based classification rule and propose an algo- g (x) = ≥ 1 K x−Xi (1) n  i=1 {Yi=1} h rithm for optimizing the performance of the rule 1 otherwise, by learning the bandwidth of the smoothing ker- D nel along with a data-dependent metric. where K : R → R is a kernel function, which is usually The data-dependent distance metric is obtained nonnegative and monotone decreasing along rays starting by learning a function that embeds an arbitrary from the origin. The number h > 0 is called the smooth- metric space into a Euclidean space while mini- ing factor, or bandwidth, of the kernel function, which pro- mizing an upper bound on the resubstitution esti- vides some form of distance weighting. We warn the reader mate of the error probability of the kernel classi- not to confuse the kernel function, K, with the reproduc- fication rule. By restricting this embedding func- ing kernel (Scholkopf¨ & Smola, 2002) of a reproducing tion to a reproducing kernel Hilbert space, we re- kernel Hilbert space (RKHS), which we will denote with 1 duce the problem to solving a semidefinite pro- K. When K(x) = 1{kxk2≤1}(x) (sometimes called the gram and show the resulting kernel classification na¨ıve kernel), the rule is similar to the k-nearest neighbor rule to be a variation of the k-nearest neighbor (k-NN) rule except that k is different for each Xi in the rule. We compare the performance of the kernel training set. The k-NN rule classifies each unlabeled ex- rule (using the learned data-dependent distance ample by the majority label among its k-nearest neighbors metric) to state-of-the-art distance metric learn- in the training set, whereas the kernel rule with the na¨ıve ing algorithms (designed for k-nearest neighbor kernel classifies each unlabeled example by the majority classification) on some benchmark datasets. The label among its neighbors that lie within a radius of h. De- results show that the proposed rule has either bet- vroye and Krzyzak˙ (1989) proved that for regular kernels ter or as good classification accuracy as the other (see Devroye et al., (1996, Definition 10.1)), if the smooth- D metric learning algorithms. ing parameter h → 0 such that nh → ∞ as n → ∞, then the kernel classification rule is universally consistent. But, for a particular n, asymptotic results provide little guid- 1. Introduction ance in the selection of h. On the other hand, selecting the wrong value of h may lead to very poor error rates. In fact, Parzen window methods, also called smoothing kernel the crux of every nonparametric estimation problem is the rules are widely used in nonparametric density estimation choice of an appropriate smoothing factor. This is one of and function estimation, and are popularly known as ker- the questions that we address in this paper by proposing an nel density and kernel regression estimates, respectively. algorithm to learn an optimal h. In this paper, we consider these rules for classification. To this end, let us consider the binary classification problem The second question that we address is learning an opti- D of classifying x ∈ RD, given an i.i.d. training sample mal distance metric. For x ∈ R , K is usually a func- n tion of kxk2. Some popular kernels include the Gaus- {(Xi,Yi)} drawn from some unknown distribution D, i=1 −kxk2 D sian kernel, K(x) = e 2 ; the Cauchy kernel, K(x) = where Xi ∈ R and Yi ∈ {0, 1}, ∀ i ∈ [n] := {1, . . . , n}. The kernel classification rule (Devroye et al., 1996, Chap- 1Unlike K, K is not required to be a positive definite func- tion. If K is a positive definite function, then it corresponds to Appearing in Proceedings of the 25 th International Conference a translation-invariant kernel of some RKHS defined on RD. In on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 such a case, the classification rule in Eq. (1) is similar to the ones by the author(s)/owner(s). that appear in kernel machines literature. Metric Embedding for Kernel Classification Rules

D+1 1/(1 + kxk2 ); and the Epanechnikov kernel K(x) = 2. Problem Formulation (1 − kxk2)1 .2 Snapp and Venkatesh (1998) have 2 {kxk2≤1} n shown that the finite-sample risk of the k-NN rule may be Let {(Xi,Yi)}i=1 denote an i.i.d. training set drawn from reduced, for large values of n, by using a weighted Eu- some unknown distribution D where Xi ∈ (X , ρ) and Yi ∈ clidean metric, even though the infinite sample risk is inde- [L], with L being the number of classes. The multi-class pendent of the metric used. This has been experimentally kernel classification rule is given by confirmed by Xing et al. (2003); Shalev-Shwartz et al. Xn

(2004); Goldberger et al. (2005); Globerson and Roweis gn(x) = arg max 1{Yi=l}KXi,h(x), (2) l∈[L] (2006); Weinberger et al. (2006). They all assume the met- i=1 p ³ ´ ric to be ρ(x, y) = (x − y)T Σ(x − y) = kL(x − y)k ρ(x,x ) 2 where K : X → R and K (x) = χ 0 for for x, y ∈ RD, where Σ = LT L is the weighting matrix, + x0,h h and optimize over Σ to improve the performance of the k- some nonnegative function, χ, with χ(0) = 1. The prob- NN rule. Since the kernel rule is similar to the k-NN rule, ability of error associated with the above rule is L(gn) := one would expect that its performance can be improved by Pr(X,Y )∼D(gn(X) 6= Y ) where Y is the true label associ- ated with X. Since D is unknown, L(gn) cannot be com- making K a function of kLxk2. Another way to interpret this is to find a linear transformation L ∈ Rd×D so that the puted directly but can only be estimated from the training 3 b transformed data lie in a Euclidean metric space. set. The resubstitution estimate, L(gn), which counts the number of errors committed on the trainingP set by the clas- Some applications call for natural distance measures that b 1 n sification rule, is given by L(gn) := n i=1 1{gn(Xi)6=Yi}. reflect the underlying structure of the data at hand. For ex- As aforementioned, when X = RD, ρ is usually cho- ample, when computing the distance between two images, sen to be k.k2. Previous works in distance metric learn- tangent distance would be more appropriate than the Eu- ing learn a linear transformation L : RD → Rd lead- clidean distance. Similarly, while computing the distance ing to thep distance metric, ρL(Xi,Xj) := kLXi − between points that lie on a low-dimensional manifold in T LXjk2 = (Xi − Xj) Σ(Xi − Xj), where Σ captures D R , geodesic distance is a more natural distance measure the variance-covariance structure of the data. In this work, than the Euclidean distance. Most of the time, since the our goal is to jointly learn h and a measurable function, true distance metric is either unknown or difficult to com- ϕ ∈ C := {ϕ : X → Rd}, so that the resubstitu- pute, Euclidean or weighted Euclidean distance is used as a tion estimate of the error probability is minimized with surrogate. In the absence of prior knowledge, the data may ρϕ(Xi,Xj) := kϕ(Xi) − ϕ(Xj)k2. Once h and ϕ are be used to select a suitable metric, which can lead to better known, the kernel classification rule is completely speci- D classification performance. In addition, instead of x ∈ R , fied by Eq. (2). suppose x ∈ (X , ρ), where X is a metric space with ρ as its metric. One would like to extend the kernel classification Devroye et al., (1996, Section 25.6) show that kernel rules b rule to such X . In this paper, we address these issues by of the form in Eq. (1) picked by minimizing L(gn) with learning a transformation that embeds the data from X into smoothing factor h > 0 are generally inconsistent if X a Euclidean metric space while improving the performance is nonatomic. The same argument can be extended to the of the kernel classification rule. multi-class rule given by Eq. (2). To learn ϕ, simply mini- mizing Lb(gn) without any smoothness conditions on ϕ can The rest of the paper is organized as follows. In §2, we lead to kernel rules that overfit the training set. Such a formulate the multi-class kernel classification rule and pro- ϕ can be constructed as follows. Let nl be the number pose learning a transformation, ϕ, (that embeds the training th of points that belong to l class. Suppose n1 = n2 = data into a Euclidean space) and the bandwidth parame- ··· = nL. Then for any h ≥ 1, choosing ϕ(X) = Yi ter, h, by minimizing an upper bound on the resubstitution n when X = Xi and ϕ(X) = 0 when X/∈ {Xi}i=1 clearly estimate of the error probability. To achieve this, in §3, yields zero resubstitution error. However, such a choice of we restrict ϕ to an RKHS and derive a representation for ϕ leads to a kernel rule that always assigns the unseen data it by invoking the generalized representer theorem. Since to class 1, leading to very poor performance. Therefore, the resulting optimization problem is non-convex, in §4, to avoid overfitting to the training set, the function class C we approximate it with a semidefinite program when K should satisfy some smoothness properties so that highly is a na¨ıve kernel. We present experimental results in §5, non-smooth functions like the one we defined above are wherein we show on benchmark datasets that the proposed not chosen while minimizing Lb(gn). To this end, we intro- algorithm performs better than k-NN and state-of-the-art duce a penalty functional, Ω: C → R+, which penalizes metric learning algorithms developed for the k-NN rule. 3Apart from the resubstitution estimate, holdout and deleted 2The Gaussian kernel is a positive definite function on RD estimates can also be used to estimate the error probability. These while the Epanechnikov and na¨ıve kernels are not. estimates are usually more reliable but more involved than the resubstitution estimate. Metric Embedding for Kernel Classification Rules non-smooth functions in C so that they are not selected.4 gradient optimization is difficult because the gradients are Therefore, our goal is to learn ϕ and h by minimizing the zero almost everywhere. In addition to the computational regularized error functional given as hardness, the problem in Eq. (5) is not theoretically solv- able unless some assumptions about C are made. In the 1 Xn L (ϕ, h) = 1 + λ0 Ω[ϕ], (3) following section, we assume C to be an RKHS with the reg n {gn(Xi)6=Yi} i=1 reproducing kernel K and provide a representation for the where ϕ ∈ C, h > 0 and the regularization parameter, optimum ϕ that minimizes Eq. (5). We remind the reader 0 that K is a smoothing kernel which is not required to be a λ > 0. gn in Eq. (3) is given by Eq. (2), with ρ replaced by positive definite function but takes on positive values, while ρϕ. Minimizing Lreg(ϕ, h) is equivalent to minimizing the K is a reproducing kernel which is positive definite and can number of training instances for which gn(X) 6= Y , over the function class, {ϕ : Ω[ϕ] ≤ s}, for some appropriately take negative values. chosen s. 3. Regularization in Reproducing Kernel Consider gn(x) defined in Eq. (2). Suppose Yi = k for some Xi. Then gn(Xi) = k if and only if Hilbert Space X X ϕ ϕ Many machine learning algorithms like SVMs, regulariza- KX ,h(Xi) ≥ max KX ,h(Xi), (4) j l∈[L] j tion networks, and logistic regression can be derived within {j:Y =k} l6=k {j:Y =l} j j the framework of regularization in RKHS by choosing an where the superscript ϕ is used to indicate the dependence appropriate empirical risk functional with the penalizer be- of K on ϕ.5 Since the right hand side of Eq. (4) involves ing the squared RKHS norm (see Evgeniou et al. (2000)). the max function which is not differentiable, we use In Eq. (5), we have extended this framework to kernel Pm the inequality max{a1, . . . , am} ≤ i=1 ai to upper classification rules, wherein we compute the ϕ ∈ C and 6 P Pn bound it with l∈[L] j=1 1{Yj =l}KXj (Xi). Thus, to h > 0 that minimize an upper bound on the resubstitution P l6=k n estimate of the error probability. To this end, we choose maximize i=1 1{gn(Xi)=Yi}, we maximize its lower bound given by C to be an RKHS with the penalty functional being the Pn 8 2 1n o, squared RKHS norm, i.e., Ω[ϕ] = kϕkC. By fixing h, i=1 P P Pn n 1 K (X ) ≥ n 1 K (X ) j=1 {Yj =Yi} Xj i j=1 {Yj 6=Yi} Xj i the objective function i=1 ψi(ϕ, h) in Eq. (5) depends j6=i n 7 on ϕ only through {kϕ(X ) − ϕ(X )k } . By letting resulting in a conservative rule. In the above bound, we ¡ i j 2 i,j=1 ¢ Pn n use j 6= i just to make sure that ϕ(Xi) is not the i=1 ψi(ϕ, h) = θh {kϕ(Xi) − ϕ(Xj)k2}i,j=1 where n2 only point within its neighborhood of radius h. Define θh : R → R+, Eq. (5) can be written as ¡ ¢ τij := 2δYi,Yj − 1 where δ represents the Kronecker delta. n 2 min min θh {kϕ(Xi) − ϕ(Xj)k2}i,j=1 +λ kϕkC. (6) Then, the problem of learning ϕ and h by minimizing h>0 ϕ∈C L (ϕ, h) in Eq. (3) reduces to solving the following reg The following result provides a representation for the min- optimization problem, imizer of Eq. (6), and is proved in Appendix A. We remind n Xn o the reader that ϕ is a vector-valued mapping from X to Rd. min ψi(ϕ, h) + λ Ω[ϕ]: ϕ ∈ C, h > 0 , (5) ϕ, h Theorem 1 (Multi-output regularization). Suppose C = i=1 {ϕ : X → Rd} = H ×...×H where H is an RKHS with 0 1 d i where λ = nλ and ψi(ϕ, h) given by reproducing kernel Ki : X ×X → R and ϕ = (ϕ1, . . . , ϕd) n o with Hi 3 ϕi : X → R. Then each minimizer ϕ ∈ C of 1 P P n 1 K (X ) < n 1 K (X ) j=1 {τij =1} Xj i j=1 {τij =−1} Xj i Eq. (6) admits a representation of the form j6=i Xn is an upper bound on 1 for i ∈ [n]. Solving the {gn(Xi)6=Yi} ϕj = cijKj(., Xi), ∀ j ∈ [d] (7) above non-convex optimization problem is NP-hard. The iP=1 where c ∈ R and n c = 0, ∀ i ∈ [n], ∀ j ∈ [d]. 4This is equivalent to restricting the size of the function class ij i=1 ij C from which ϕ has to be selected. 8Another choice for C could be the space of bounded Lipschitz 5To simplify the notation, from now onwards, we write ϕ functions with the penalty functional, Ω[ϕ] = kϕkL, where kϕkL K (Xi) as KX (Xi) where the dependence of K on ϕ and h Xj ,h j is the Lipschitz semi-norm of ϕ. With this choice of C and Ω, von is implicit. Luxburg and Bousquet (2004) studied large margin classification 6 Another upper bound that¡P can be used¢ for the max function in metric spaces. One more interesting choice for C could be the m ai is max{a1, . . . , am} ≤ log i=1 e . space of Mercer kernel maps. It can be shown that solving for 7Using the upper bound of max function in Eq. (4) makes the ϕ in Eq. (5) with such a choice for C is equivalent to learning n resulting kernel rule conservative as there can be samples from the kernel matrix associated with ϕ and {Xi}i=1. However, this the training set that do not satisfy this inequality but get correctly approach is not useful as it does not allow for an out-of-sample classified according to Eq. (2). extension. Metric Embedding for Kernel Classification Rules

Remark 2. (a) By Eq. (7), ϕ is completely determined by 4. Convex Relaxations & Semidefinite {cij : i ∈ [n], j ∈ [d]}. Therefore, the problem of learning Program Pn ϕ reduces to learning n · d scalars, {cij : i=1 cij = 0}. Having addressed the theoretical issue of making assump- (b) θh in Eq. (6) depends on ϕ through kϕ(.) − tions about C to solve Eq. (5), we return to address the com- ϕ(.)k2. Therefore, for any z, w ∈ X , we have putational issue pointed out in §2. The program in Eq. (5) P £ ¤2 kϕ(z) − ϕ(w)k2 = d cT (kz − kw ) = n 2 m=1 m m m is NP-hard because of the nature of {ψi}i=1. This issue Pd z w z w T m=1 tr(Σm(km − km)(km − km) ) where cm := can be alleviated by minimizing a convex upper bound of T z T (c1m, . . . , cnm) , km := (Km(z, X1),..., Km(z, Xn)) , ψi, instead of ψi. Some of the convex upper bounds for the T Σm := cmcm, ∀ m ∈ [d] and tr(.) represents the trace. function ψ(x) = 1{x>0} are Ψ(x) = max(0, 1 + x) := [1 + x] , Ψ(x) = log (1 + ex) etc. Replacing ψ by Ψ in (c) The regularizer, kϕk2 in Eq. (6) is given by kϕk2 = + i i P P C P C Eq. (5) results in the following program, d kϕ k2 = d n c c K (X ,X ) = m=1 m Hm m=1 i,j=1 im jm m i j P P n d T d X ¡ ¢ m=1 cmKmcm = m=1 tr(KmΣm) where Km := − + 2 X X min Ψi γi (ϕ, h) − γi (ϕ, h) + λ kϕkC, (8) (k 1 ,..., k n ). ϕ∈C m m h>0 i=1 2 (d) Since ϕ appears in the form of ρϕ and kϕkC in + P − where γi (ϕ, h) := j6=i KXj (Xi) and γi (ϕ, h) := Eq. (6), learning ϕ is equivalent to learning {Σ º 0 : τ =1 m P ij rank(Σ ) = 1, 1T Σ 1 = 0}d . K (X ). Eq. (8) can still be computation- m m m=1 {j:τij =−1} Xj i ally hard to solve depending on the choice of the smooth- In the above remark, we have shown that θh and kϕkC in ing kernel, K. Even if we choose K such that γ+ and γ− Eq. (6) depend only on the entries in d kernel matrices (as- are jointly convex in ϕ and h for some representation of ϕ sociated with d kernel functions) and n · d scalars, {cij}. (see Remark 2), Eq. (8) is still non-convex as the argument In addition, we also reduced the representation of ϕ from 9 d d 2 2 of Ψi is a difference of two convex functions. In addition, {cm}m=1 to {Σm}m=1. It can be seen that ρϕ and kϕk C if Ψ(x) = [1 + x]+, then Eq. (8) is a d.c. (difference of d are convex quadratic functions of {cm}m=1, while they are convex functions) program (Horst & Thoai, 1999), which d linear functions of {Σm}m=1. Depending on the nature of is NP-hard to solve. So, even for the nicest of cases, one K, one representation would be more useful than the other. has to resort to local optimization methods or computation- Corollary 3. Suppose K1 = ... = Kd = K. Then, for any ally intensive global optimization methods. Nevertheless, 2 z, w ∈ X , ρϕ(z, w) is the Mahalanobis distance between if one does not worry about this disadvantage, then solv- P d d z w d ing Eq. (8) yields ϕ (in terms of {cm} or {Σm} , k and k , with Σm as its metric. m=1 m=1 m=1 depending on the chosen representation) and h that can be 2 used in Eq. (2) to classify unseen data. However, in the Proof. By Remark 2, we have ρϕ(z, w) = kϕ(z) − 2 Pd z w T z w following, we show that Eq. (8) can be turned into a con- ϕ(w)k2 = m=1 (km − km) Σm (km − km). Since z z z vex program for the na¨ıve kernel. As mentioned in §1, this K1 = ... = Kd = K, we have k1 = ... = kd = k . 2 z w T z w choice of kernel leads to a classification rule that is similar Therefore, ρϕ(z, w) = (k − k ) Σ(k − k ) where Pd in principle to the k-NN rule. Σ := m=1 Σm. The above result reduces the problem of learning ϕ to 4.1. Na¨ıve kernel: Semidefinite relaxation learning a matrix, Σ º 0, such that rank(Σ) ≤ d and The na¨ıve kernel, K (x) = 1 , indicates that T x0 {ρϕ(x,x0)≤h} 1 Σ1 = 0. We now study the above result for linear ker- the points, ϕ(x), that lie within a ball of radius h centered nels. The following corollary shows that applying a lin- at ϕ(x0) have a weighting factor of 1, while the remain- ear kernel is equivalent to assuming the underlying distance ing points have zero weight. Using this in Eq. (8), we have metric in X to be the Mahalanobis distance. − + P γ (ϕ, h) − γ (ϕ, h) = 1{ρ (X ,X )≤h} − Pi i {j:τij =−1} ϕ i j Corollary 4 (Linear kernel). Let X = RD and z, w ∈ X . {j:τ =1} 1{ρϕ(Xi,Xj )≤h} + 1, which represents the dif- T d ij If K(z, w) = hz, wi2 = z w, then ϕ(z) = Lz ∈ R and ference between number of points with label different from 2 T T kϕ(z) − ϕ(w)k2 = (z − w) A(z − w) with A := L L. Yi that lie within the ball of radius of h centered at ϕ(Xi) and the number of points with the same label as Xi (exclud- Proof. By Remark 2 and Corollary 3, we have ϕm(z) = ing Xi) that lie within the same ball. If this difference is Pn Pn T T i=1 cimK(z, Xi) = ( i=1 cimXi) z =: `mz. There- T 9 fore, ϕ(z) = Lz, where L := (`1,..., `d) . In addition, For example, let K be a Gaussian kernel, Ky(x) = 2 T T 2 d kϕ(z)−ϕ(w)k = (z−w) A(z−w) with A := L L. exp(−ρϕ(x, y)/h). Using the {Σm}m=1 representation for ϕ, 2 2 d we have ρϕ(x, y) is linear in {Σm}m=1 and therefore, Ky(x) is d In the following section, we use these results to derive an convex in {Σm}m=1. Here, we assume h to be fixed. This + − algorithm that jointly learns ϕ and h by solving Eq. (5). γi and γi are convex in ϕ. Metric Embedding for Kernel Classification Rules positive, then the classification rule in Eq. (2) makes an er- Algorithm 1 Gradient Projection Algorithm ror in classifying Xi. Therefore, ϕ and h should be chosen n n + n Require: {Mij}i,j=1, K, {τij }i,j=1, {ni }i=1, λ > 0, such that this misclassification rate is minimized. Suppose ² > 0 and {αi, βi} > 0 (see Eq. (9)) that {ϕ(X )}n is given. Then, h determines the misclas- ˜ i i=1 1: Set t = 0. Choose Σ0 ∈ A and h0 > 0. sification rate like k in k-NN. It can be seen that the kernel 2: repeat h i classification rule and k-NN rule are similar when K is a Pn ˜ 3: At = {i : j=1 1 + τij tr(MijΣt) − τij ht + na¨ıve kernel. In the case of k-NN, the number of nearest + + neighbors are fixed for any point, whereas with the kernel 2 ≤ ni } × {j : j ∈ [n]} ˜ rule, it varies for every point. On the other hand, the ra- 4: Bt = {(i, j) : 1 + τij tr(MijΣt) > τijht} dius of the ball containing the nearest neighbors of a point 5: Nt = Bt\At P varies with every point in the k-NN setting while it is the 6: Σt+1 = PN (Σt − αt τijMij − αtλK) P(i,j)∈Nt same for every point in the kernel rule. 7: h˜ = max(², h˜ + β τ ) t+1 t t (i,j)∈Nt ij − + 8: t = t + 1 γi (ϕ, h) − γi (ϕ, h) can be further reduced to a more amenable form by the following algebra. Us- 9: until convergence P Pn 10: return Σ , h˜ ing 1{ρ (X ,X )≤h} = 1{τ =1} − t t P {j:τij =1} ϕ i j j=1 ij 1 , we get γ−(ϕ, h) − {j:τij =1} {ρϕ(Xi,Xj )>h} i 2 + + Pn gradient method (which scales as O(n ) per iteration) and γ (ϕ, h) = 1 − n + 1 2 ˜ where 3 i i j=1 {τij ρ (Xi,Xj )>τij h} an alternating projections method (which scales as O(n ) P ϕ + n ˜ 2 per iteration). At each iteration, we take a small step in the ni := j=1 1{τij =1} and h := h . Note that we have neglected the set {j : τij = −1; ρϕ(Xi,Xj) = h} direction of the negative gradient of the objective function, in the above calculation for simplicity. Using followed by a projection onto the set N = {Σ : Σ º T ˜ Ψ(x) = [1 + x]+, the first half of the objective 0, 1 Σ1 = 0} and {h > 0}. The projection onto N is P h function in Eq. (8) reduces to n 2 − n+ + performed by an alternating projections method which in- i i=1 i volves projecting a symmetric matrix alternately between Pn j=1 1{τ ρ2 (X ,X )>τ h˜} . Applying the convex the convex sets, A = {Σ : Σ º 0} and B = {Σ : ij ϕ i j ij + relaxation one more time to the step function, we get 1T Σ1 = 0}. Since A ∩ B 6= ∅, this alternating projec- P h P h i i n + n 2 ˜ tions method is guaranteed to find a point in A ∩ B. Given i=1 2 − ni + j=1 1 + τijρϕ(Xi,Xj) − τijh + + any A0 ∈ A, the alternating projections algorithm com- as an upper bound on the first half of the objective putes Bm = PB(Am) ∈ B, Am+1 = PA(Bm) ∈ A, m = function in Eq. (8). Since ρ2 is a quadratic function of ϕ 0, 1, 2,... , where PA and PB are the projection on A and d {cm}m=1, it can be shown that representing ϕ in terms B, respectively. In summary, the update rule can be given as d T Pn of {cm}m=1 results in a d.c. program, whereas its repre- B = A − 1 Am1 11T and A = [λ ] u uT d m m n2 m+1 i=1 i + i i sentation in terms of {Σm}m=1 results in a semidefinite n n where {ui}i=1 and {λi}i=1 are the eigenvectors and eigen- program (SDP) (except for the rank constraints), since values of B .10 A pseudocode of the gradient projection 2 d m ρϕ is linear in {Σm}m=1. Assuming for simplicity that algorithm to solve Eq. (9) is shown in Algorithm 1. K1 = ... = Kd = K and neglecting the constraint rank(Σ) ≤ d, we obtain the following SDP, Having computed Σ and h˜ that minimize Eq. (9), a test Xn h Xn h i i point, x ∈ X , can be classified by using the kernel rule in + ˜ Eq. (2), where K (x) = 1 with ρ2 (x, X ) = min 2 − ni + 1 + τijtr(Mij Σ) − τij h Xi {ρϕ(x,Xi)≤h} ϕ i Σ,h˜ + + i=1 j=1 (kx −kXi )T Σ(kx −kXi ). Therefore, Σ and h completely +λ tr(KΣ) specify the classification rule. s.t. Σ º 0, 1T Σ1 = 0, h˜ > 0, (9)

Xi Xj Xi Xj T 5. Experiments & Results where Mij := (k − k )(k − k ) . For notational details, refer to Remark 2 and Corollary 3. Since one does In this section, we compare the performance of our method not usually know the optimal embedding dimension, d, the (referred to as kernel classification rule (KCR)) to several Σ representation is advantageous as it is independent of d metric learning algorithms on a supervised classification (as we neglected the rank constraint) and depends only on task in terms of the training and test errors. The training n. On the other hand, it is a disadvantage as the algorithm phase of KCR involves solving the SDP in Eq. (9) to learn does not scale well to large datasets. optimal Σ and h from the data, which are then used in Although the program in Eq. (9) is convex, solving it by Eq. (2) to classify the test data. Note that the SDP in Eq. (9) general purpose solvers that use interior point methods 10 Given Am ∈ A, Bm is obtained by solving min{kBm − 6 2 T scales as O(n ), which is prohibitive. Instead, following AmkF : 1 Bm1 = 0}. Similarly, for a given Bm ∈ B, Am+1 2 the ideas of Weinberger et al. (2006), we used a first order is obtained by solving min{kAm+1 − BmkF : Am+1 º 0}. Metric Embedding for Kernel Classification Rules

Table 1. k-NN classification accuracy on UCI datasets. The algorithms compared are k-NN (with Euclidean distance metric), LMNN (large margin NN by Weinberger et al. (2006)), Kernel-NN (see footnote 11), KMLCC (kernel version of metric learning by collapsing classes by Globerson and Roweis (2006)), KLMCA (kernel version of LMNN by Torresani and Lee (2007)), and KCR (proposed method). (µ) and (σ) of the train and test (generalization) errors (in %) are reported.

Dataset Algorithm/ k-NN LMNN Kernel-NN KMLCC KLMCA KCR (n, D, l) Error µ ± σ µ ± σ µ ± σ µ ± σ µ ± σ µ ± σ Balance Train 17.81 ± 1.86 11.40 ± 2.89 10.73 ± 1.32 10.27 ± 2.01 9.93 ± 1.86 10.47 ± 2.11 (625, 4, 3) Test 18.18 ± 1.88 11.49 ± 2.57 17.46 ± 2.13 9.75 ± 1.92 10.54 ± 1.46 8.94 ± 3.12 Ionosphere Train 15.89 ± 1.43 3.50 ± 1.18 2.84 ± 0.80 7.05 ± 1.31 3.98 ± 1.94 2.73 ± 1.03 (351, 34, 2) Test 15.95 ± 3.03 12.14 ± 2.92 5.81 ± 2.25 6.54 ± 2.18 5.19 ± 2.09 5.71 ± 2.60 Iris Train 4.30 ± 1.55 3.25 ± 1.15 3.60 ± 1.33 3.61 ± 1.59 3.27 ± 1.63 2.29 ± 1.62 (150, 4, 3) Test 4.02 ± 2.22 4.11 ± 2.26 4.83 ± 2.47 3.89 ± 1.55 3.74 ± 2.21 3.27 ± 1.87 Wine Train 5.89 ± 1.35 0.90 ± 2.80 4.95 ± 1.35 4.48 ± 1.21 2.18 ± 2.58 1.01 ± 0.73 (178, 13, 3) Test 6.22 ± 2.70 3.41 ± 2.10 7.37 ± 2.82 4.84 ± 2.47 5.17 ± 1.91 2.13 ± 1.24 is obtained by using the na¨ıve kernel for K in Eq. (2). For the average performance over 20 random splits of the data other smoothing kernels, one has to solve the program in with 50% for training, 20% for validation and 30% for test- −υkx−yk2 Eq. (8) to learn optimal Σ and h. Therefore, the results re- ing. The Gaussian kernel, K(x, y) = e 2 was used ported in this section under KCR refer to those obtained by for the kernel based methods, i.e., Kernel-NN, KMLCC, using the na¨ıve kernel. KLMCA and KCR. The parameters υ and λ (only υ for Kernel-NN) were set with cross-validation by searching The algorithms used in the comparative evaluation are: i 4 i 3 over υ ∈ {2 }−4 and λ ∈ {10 }−3. While testing, KCR • The k-NN rule with the Euclidean distance metric. uses the rule in Eq. (2), whereas the k-NN rule was used 13 • The LMNN (large margin nearest neighbor) method for all the other methods. It is clear from Table 1 that proposed by Weinberger et al. (2006), which learns KCR almost always performs as well as or significantly a Mahalanobis distance metric by minimizing the dis- better than all other methods. However, on the timing front tance between predefined target neighbors and sepa- (which we do not report here), KLMCA, which solves a rating them by a large margin from the examples with non-convex program for n · d variables, is much faster than 2 non-matching labels. KMLCC and KCR, which solve SDPs involving n vari- ables. The role of empirical kernel maps is not clear as • The Kernel-NN rule, which uses the empirical kernel there is no consistent behavior between the performance maps11 as training data and performs k-NN classifica- accuracy achieved with k-NN and Kernel-NN. tion on this data using the Euclidean distance metric. KMLCC, KLMCA, and KCR learn the Mahalanobis dis- • The KMLCC (kernel version of metric learning by n collapsing classes) method proposed by Globerson tance metric in R which makes it difficult to visualize and Roweis (2006), which learns a Mahalanobis dis- the class separability achieved by these methods. To vi- tance metric in the kernel space by trying to collapse sually appreciate their behavior, we generated a synthetic all examples in the same class to a single point while two dimensional dataset of 3 classes with each class being pushing examples in other classes infinitely far away. sampled from a Gaussian distribution with different mean and covariance. Figure 1(a) shows this dataset where the • The KLMCA (kernel version of large margin compo- three classes are shown in different colors. Using this as nent analysis) method proposed by Torresani and Lee training data, distance metrics were learned using KMLCC, (2007), which is a non-convex, kernelized version of KLMCA and KCR. If Σ is the learned metric, then the two LMNN. n dimensional projection√ √ of x ∈ R is obtained as xb = Lx T Pn T Four benchmark datasets from the UCI machine learning where L = ( λ1u1, λ2u2) , with Σ = i=1 λiuiui , repository were considered for experimentation. Since the and λ1 ≥ λ2 > ··· λn. Figure 1(b-d) show the two di- proposed method and KMLCC solve an SDP that scales 13Although KCR, LMNN, and KMLCC solve SDPs to com- poorly with n, we did not consider large problem sizes pute the optimal distance metric, KCR has fewer number of pa- for experimentation.12 The results shown in Table 1 are rameters to be tuned compared to these other methods. LMNN requires cross-validation over k (in k-NN) and the regulariza- 11 Kernel-NN is computed as follows. For each training point, tion parameter along with the knowledge about target neighbors. n Xi Xi, the empirical map w.r.t. {Xj }j=1 defined as k := KMLCC requires cross-validation over k, the kernel parameter, T Xi n (K(X1,Xi),..., K(Xn,Xi)) is computed. Then, {k }i=1 is υ and the regularization parameter. In KCR, we only need to considered to be the training set for the NN classification of em- cross-validate over υ and λ. In addition, if X = RD and K is pirical maps of the test data using the Euclidean distance metric. a linear kernel, then KCR only requires cross-validation over λ 12To extend KCR to large datasets, one can represent ϕ in terms while computing the optimal Mahalanobis distance metric. of {cm}, which leads to a non-convex program as in KLMCA. Metric Embedding for Kernel Classification Rules

k−NN, Error = 6.67 KMLCC, Error = 3.33 KLMCA, Error = 2.67 KCR, Error = 2.67 1.5 10 1.5 1.5 1 1 1 5 0.5 0.5 0.5 0 0 0 0

−0.5 −0.5 −0.5 −5 −1 −1 −1 −10 −1.5 −1.5 −1.5 −10 −5 0 5 10 −1 0 1 −1 0 1 −1 0 1 (a) (b) (c) (d)

k−NN, Error = 10.7 KMLCC, Error = 10.0 KLMCA, Error = 8.0 KCR, Error = 8.0 10 1.5 1.5 1.5 1 1 1 5 0.5 0.5 0.5 0 0 0 0

−0.5 −0.5 −0.5 −5 −1 −1 −1 −1.5 −1.5 −10 −1.5 −10 −5 0 5 10 −1 0 1 −1 0 1 −1 0 1 (a0) (b0) (c0) (d0)

Figure 1. Dataset visualization results of k-NN, KMLCC, KLMCA and KCR applied to a two-dimensional synthetic dataset of three classes with each class being modeled as a Gaussian. (a,a0) denote two independent random draws from this distribution whereas (b-d, b0-d0) represent the two-dimensional projections of these data using the metric learned from KMLCC, KLMCA and KCR. The points in bold represent the misclassified points. It is interesting to note that KLMCA and KCR generate completely different embeddings but have similar error rates. See §5 for more details. mensional projections of the training set using KMLCC, scent techniques. Except for this work, we are not aware of KLMCA and KCR. The projected points were classified any method related to kernel rules in the context of distance using k-NN if Σ was obtained from KMLCC/KLMCA and metric learning or learning the bandwidth of the kernel. using Eq. (2) if Σ was obtained from KCR. The misclas- There has been lot of work in the area of distance metric sified points are shown in bold. Since the classification is learning for k-NN classification, some of which are briefly done on the training points, one would expect better error discussed in §5. The central idea in all these methods rate and separability between the classes. To understand is that similarly labeled examples should cluster together the generalization performance, a new data sample shown and be far away from differently labeled examples. Shalev- in Figure 1(a0) was generated from the same distribution Shwartz et al. (2004) proposed an online algorithm for as the training set. The learned Σ was used to obtain the learning a Mahalanobis distance metric with the constraint two dimensional projections of the new data sample which that any training example is closer to all the examples that are shown in Figure 1(b0-d0). It is interesting to note that share its label than to any other example of different label. KLMCA and KCR generate completely different projec- In addition, examples from different classes are constrained tions but have similar error rates. to be separated by a large margin. Though Shalev-Shwartz et al. (2004) do not solve this as a batch optimization prob- 6. Related Work lem, it can be shown that it reduces to an SDP (after rank relaxation) and is in fact the same as Eq. (9) except for the We briefly review some relevant work and point out simi- outer [.] function and the constraint 1T Σ1 = 0. larities and differences with our method. In our work, we + have addressed the problem of extending kernel classifica- tion rules to arbitrary metric spaces by learning an embed- 7. Concluding Remarks ding function that embeds data into Euclidean space while In this paper, two questions related to the smoothing ker- minimizing an upper bound on the resubstitution estimate nel based classification rule have been addressed. One is of the error probability. The method that is closest in spirit related to learning the bandwidth of the smoothing kernel, (kernel rules) to ours is the recent work by Weinberger and while the other is to extending the classification rule to ar- Tesauro (2007) who learn a Mahalanobis distance metric bitrary domains. We jointly addressed them by learning a for kernel regression estimates by minimizing the leave- function in a reproducing kernel Hilbert space while mini- one-out quadratic regression error of the training set. With mizing an upper bound on the resubstitution estimate of the the problem being non-convex, they resort to gradient de- error probability of the kernel rule. For a particular choice Metric Embedding for Kernel Classification Rules of the smoothing kernel, called the na¨ıve kernel, we showed from the National Science Foundation (grant DMS-MSPA that the resulting rule is related to the k-NN rule. Because 0625409), the Fair Isaac Corporation and the University of of this relation, the kernel rule was compared to k-NN and California MICRO program. its state-of-the-art distance metric learning algorithms on a supervised classification task and was shown to have com- References parable performance to these methods. In the future, we would like to develop some theoretical guarantees for the Devroye, L., Gyorfi, L., & Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag. proposed method along with extending it to large-scale ap- plications. Devroye, L., & Krzyzak,˙ A. (1989). An equivalence theorem for L1 convergence of the kernel regression estimate. Journal of Statistical Planning and Inference, 23, 71–82. Appendix A. Proof of Theorem 1 Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization We need the following result to prove Theorem 1. networks and support vector machines. Advances in Computa- Lemma 5. Let H = {f : X → R} be an RKHS with K : tional Mathematics, 13, 1–50. 2 X × X → R as its reproducing kernel. Let θ : Rn → R Globerson, A., & Roweis, S. (2006). Metric learning by collaps- be an arbitrary function. Then each minimizer f ∈ H of ing classes. Advances in Neural Information Processing Sys- tems 18 (pp. 451–458). Cambridge, MA: MIT Press. ¡ n ¢ 2 θ {f(xi) − f(xj)}i,j=1 + λkfkH (10) Pn Goldberger, J., Roweis, S., Hinton, G., & Salakhutdinov, R. admits a representation of the form f = i=1 ciK(., xi), (2005). Neighbourhood components analysis. Advances in n Pn where {ci}i=1 ∈ R and i=1 ci = 0. Neural Information Processing Systems 17 (pp. 513–520). Cambridge, MA: MIT Press. Proof. The proof follows the generalized representer the- orem (Scholkopf¨ et al., 2001, Theorem 4). Since Horst, R., & Thoai, N. V. (1999). D.c. programming: Overview. Journal of Optimization Theory and Applications, 103, 1–43. f ∈ H, f(x) = hf, K(., x)iH. Therefore, the argu- ments of θ in Eq. (10) are of the form {hf, K(., xi) − Scholkopf,¨ B., Herbrich, R., & Smola, A. J. (2001). A general- K(., x )i }n . We decompose f = f + f ized representer theorem. Proc. of the Annual Conference on j H i,j=1 ¡ k ¢ ⊥ n Computational Learning Theory (pp. 416–426). so that fk ∈ span {K(., xi) − K(., xj)}i,j=1 and hf⊥, K(., xi) − K(., xj)iH = 0, ∀ i, j ∈ [n]. So, f = Scholkopf,¨ B., & Smola, A. J. (2002). Learning with kernels. Pn n i,j=1 αij (K(., xi) − K(., xj)) + f⊥ where {αij}i,j=1 ∈ Cambridge, MA: MIT Press. R. Therefore, f(xi)−f(xj) = hf, K(., xi)−K(., xj)iH = Pn Shalev-Shwartz, S., Singer, Y., & Ng, A. Y. (2004). Online and hfk, K(., xi) − K(., xj)iH = p,m=1 αpm(K(xi, xp) − batch learning of pseudo-metrics. Proc. of 21st International K(xj, xp) − K(xi, xm) + K(xj, xm)). Now, consider Conference on Machine Learning. Banff, Canada. the penalty functional,Phf, fiH. For all f⊥, hf, fiH = ||f ||2 + ||f ||2 ≥ k n α (K(., x ) − K(., x ))k2 . Snapp, R. R., & Venkatesh, S. S. (1998). Asymptotic expansions k H ⊥ H i,j=1 ij i j H of the k nearest neighbor risk. The Annals of Statistics, 26, Thus for any fixed αij ∈ R, Eq. (10) is minimized for 850–878. f⊥ = 0. Therefore,P the minimizer of Eq. (10) has the n Torresani, L., & Lee, K. (2007). Large margin component anal- form f = i,j=1 αij (K(., xi) − K(., xj)), which is pa- 2 n ysis. Advances in Neural Information Processing Systems 19 rameterized by n parameters of {αij }i,j=1. By simple Pn (pp. 1385–1392). Cambridge, MA: MIT Press. Palgebra, f reduces to f =P i=1 ciK(., xi), where ci = n n von Luxburg, U., & Bousquet, O. (2004). Distance-based classifi- (αij − αji) satisfies ci = 0. j=1 i=1 cation with Lipschitz functions. Journal for Machine Learning We are now ready to prove Theorem 1. Research, 5, 669–695.

Proof of Theorem 1. The arguments of θh in Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance Eq. (6) are of the form kϕ(Xi) − ϕ(Xj)k2. Consider metric learning for large margin nearest neighbor classification. 2 Pd 2 Advances in Neural Information Processing Systems 18 (pp. kϕ(Xi) − ϕ(Xj)k2 = m=1 (ϕm(Xi) − ϕm(Xj)) = Pd 2 1473–1480). Cambridge, MA: MIT Press. m=1 (hϕm, Km(., Xi) − Km(., Xj)iHm ) . The penal- 2 Pd 2 Weinberger, K. Q., & Tesauro, G. (2007). Metric learning for izer in Eq. (6) reduces to kϕk = kϕmk . There- C m=1 Hm kernel regression. Proc. of the 11th International Conference fore, applying Lemma 5 to each ϕm, m ∈ [d] proves the on Artificial Intelligence and Statistics. result. Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. (2003). Dis- tance metric learning with application to clustering with side- Acknowledgments information. Advances in Neural Information Processing Sys- tems 15 (pp. 505–512). Cambridge, MA: MIT Press. We thank the reviewers for their comments which greatly improved the paper. We wish to acknowledge support