Metric Embedding for Kernel Classification Rules
Total Page:16
File Type:pdf, Size:1020Kb
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur [email protected] Omer A. Lang [email protected] Gert R. G. Lanckriet [email protected] Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093 USA. Abstract ter 10) is given by 8 Pn ¡ ¢ In this paper, we consider a smoothing kernel x¡Xi < 0 if i=1 1fYi=0gK h Pn ¡ ¢ based classification rule and propose an algo- g (x) = ¸ 1 K x¡Xi (1) n : i=1 fYi=1g h rithm for optimizing the performance of the rule 1 otherwise; by learning the bandwidth of the smoothing ker- D nel along with a data-dependent distance metric. where K : R ! R is a kernel function, which is usually The data-dependent distance metric is obtained nonnegative and monotone decreasing along rays starting by learning a function that embeds an arbitrary from the origin. The number h > 0 is called the smooth- metric space into a Euclidean space while mini- ing factor, or bandwidth, of the kernel function, which pro- mizing an upper bound on the resubstitution esti- vides some form of distance weighting. We warn the reader mate of the error probability of the kernel classi- not to confuse the kernel function, K, with the reproduc- fication rule. By restricting this embedding func- ing kernel (Scholkopf¨ & Smola, 2002) of a reproducing tion to a reproducing kernel Hilbert space, we re- kernel Hilbert space (RKHS), which we will denote with 1 duce the problem to solving a semidefinite pro- K. When K(x) = 1fkxk2·1g(x) (sometimes called the gram and show the resulting kernel classification na¨ıve kernel), the rule is similar to the k-nearest neighbor rule to be a variation of the k-nearest neighbor (k-NN) rule except that k is different for each Xi in the rule. We compare the performance of the kernel training set. The k-NN rule classifies each unlabeled ex- rule (using the learned data-dependent distance ample by the majority label among its k-nearest neighbors metric) to state-of-the-art distance metric learn- in the training set, whereas the kernel rule with the na¨ıve ing algorithms (designed for k-nearest neighbor kernel classifies each unlabeled example by the majority classification) on some benchmark datasets. The label among its neighbors that lie within a radius of h. De- results show that the proposed rule has either bet- vroye and Krzyzak˙ (1989) proved that for regular kernels ter or as good classification accuracy as the other (see Devroye et al., (1996, Definition 10.1)), if the smooth- D metric learning algorithms. ing parameter h ! 0 such that nh ! 1 as n ! 1, then the kernel classification rule is universally consistent. But, for a particular n, asymptotic results provide little guid- 1. Introduction ance in the selection of h. On the other hand, selecting the wrong value of h may lead to very poor error rates. In fact, Parzen window methods, also called smoothing kernel the crux of every nonparametric estimation problem is the rules are widely used in nonparametric density estimation choice of an appropriate smoothing factor. This is one of and function estimation, and are popularly known as ker- the questions that we address in this paper by proposing an nel density and kernel regression estimates, respectively. algorithm to learn an optimal h. In this paper, we consider these rules for classification. To this end, let us consider the binary classification problem The second question that we address is learning an opti- D of classifying x 2 RD, given an i.i.d. training sample mal distance metric. For x 2 R , K is usually a func- n tion of kxk2. Some popular kernels include the Gaus- f(Xi;Yi)g drawn from some unknown distribution D, i=1 ¡kxk2 D sian kernel, K(x) = e 2 ; the Cauchy kernel, K(x) = where Xi 2 R and Yi 2 f0; 1g; 8 i 2 [n] := f1; : : : ; ng. The kernel classification rule (Devroye et al., 1996, Chap- 1Unlike K, K is not required to be a positive definite func- tion. If K is a positive definite function, then it corresponds to Appearing in Proceedings of the 25 th International Conference a translation-invariant kernel of some RKHS defined on RD. In on Machine Learning, Helsinki, Finland, 2008. Copyright 2008 such a case, the classification rule in Eq. (1) is similar to the ones by the author(s)/owner(s). that appear in kernel machines literature. Metric Embedding for Kernel Classification Rules D+1 1=(1 + kxk2 ); and the Epanechnikov kernel K(x) = 2. Problem Formulation (1 ¡ kxk2)1 .2 Snapp and Venkatesh (1998) have 2 fkxk2·1g n shown that the finite-sample risk of the k-NN rule may be Let f(Xi;Yi)gi=1 denote an i.i.d. training set drawn from reduced, for large values of n, by using a weighted Eu- some unknown distribution D where Xi 2 (X ; ½) and Yi 2 clidean metric, even though the infinite sample risk is inde- [L], with L being the number of classes. The multi-class pendent of the metric used. This has been experimentally kernel classification rule is given by confirmed by Xing et al. (2003); Shalev-Shwartz et al. Xn (2004); Goldberger et al. (2005); Globerson and Roweis gn(x) = arg max 1fYi=lgKXi;h(x); (2) l2[L] (2006); Weinberger et al. (2006). They all assume the met- i=1 p ³ ´ ric to be ½(x; y) = (x ¡ y)T §(x ¡ y) = kL(x ¡ y)k ½(x;x ) 2 where K : X! R and K (x) = Â 0 for for x; y 2 RD, where § = LT L is the weighting matrix, + x0;h h and optimize over § to improve the performance of the k- some nonnegative function, Â, with Â(0) = 1. The prob- NN rule. Since the kernel rule is similar to the k-NN rule, ability of error associated with the above rule is L(gn) := one would expect that its performance can be improved by Pr(X;Y )»D(gn(X) 6= Y ) where Y is the true label associ- ated with X. Since D is unknown, L(gn) cannot be com- making K a function of kLxk2. Another way to interpret this is to find a linear transformation L 2 Rd£D so that the puted directly but can only be estimated from the training 3 b transformed data lie in a Euclidean metric space. set. The resubstitution estimate, L(gn), which counts the number of errors committed on the trainingP set by the clas- Some applications call for natural distance measures that b 1 n sification rule, is given by L(gn) := n i=1 1fgn(Xi)6=Yig. reflect the underlying structure of the data at hand. For ex- As aforementioned, when X = RD, ½ is usually cho- ample, when computing the distance between two images, sen to be k:k2. Previous works in distance metric learn- tangent distance would be more appropriate than the Eu- ing learn a linear transformation L : RD ! Rd lead- clidean distance. Similarly, while computing the distance ing to thep distance metric, ½L(Xi;Xj) := kLXi ¡ between points that lie on a low-dimensional manifold in T LXjk2 = (Xi ¡ Xj) §(Xi ¡ Xj), where § captures D R , geodesic distance is a more natural distance measure the variance-covariance structure of the data. In this work, than the Euclidean distance. Most of the time, since the our goal is to jointly learn h and a measurable function, true distance metric is either unknown or difficult to com- ' 2 C := f' : X! Rdg, so that the resubstitu- pute, Euclidean or weighted Euclidean distance is used as a tion estimate of the error probability is minimized with surrogate. In the absence of prior knowledge, the data may ½'(Xi;Xj) := k'(Xi) ¡ '(Xj)k2. Once h and ' are be used to select a suitable metric, which can lead to better known, the kernel classification rule is completely speci- D classification performance. In addition, instead of x 2 R , fied by Eq. (2). suppose x 2 (X ; ½), where X is a metric space with ½ as its metric. One would like to extend the kernel classification Devroye et al., (1996, Section 25.6) show that kernel rules b rule to such X . In this paper, we address these issues by of the form in Eq. (1) picked by minimizing L(gn) with learning a transformation that embeds the data from X into smoothing factor h > 0 are generally inconsistent if X a Euclidean metric space while improving the performance is nonatomic. The same argument can be extended to the of the kernel classification rule. multi-class rule given by Eq. (2). To learn ', simply mini- mizing Lb(gn) without any smoothness conditions on ' can The rest of the paper is organized as follows. In x2, we lead to kernel rules that overfit the training set. Such a formulate the multi-class kernel classification rule and pro- ' can be constructed as follows. Let nl be the number pose learning a transformation, ', (that embeds the training th of points that belong to l class. Suppose n1 = n2 = data into a Euclidean space) and the bandwidth parame- ¢ ¢ ¢ = nL. Then for any h ¸ 1, choosing '(X) = Yi ter, h, by minimizing an upper bound on the resubstitution n when X = Xi and '(X) = 0 when X2 = fXigi=1 clearly estimate of the error probability. To achieve this, in x3, yields zero resubstitution error. However, such a choice of we restrict ' to an RKHS and derive a representation for ' leads to a kernel rule that always assigns the unseen data it by invoking the generalized representer theorem.