New Probabilistic Bounds on Eigenvalues and Eigenvectors of Random Kernel Matrices
Total Page:16
File Type:pdf, Size:1020Kb
New Probabilistic Bounds on Eigenvalues and Eigenvectors of Random Kernel Matrices Nima Reyhani Hideitsu Hino Ricardo Vig´ario Aalto University, School of Science Waseda University Aalto University, School of Science Helsinki, Finland Tokyo, Japan Helsinki, Finand nima.reyhani@aalto.fi [email protected] ricardo.vigario@aalto.fi Abstract et. al., 2002) and (Jia & Liao, 2009) are good examples of such procedure. Also, the Rademacher complexity, Kernel methods are successful approaches for and therefore the excess loss of the large margin clas- different machine learning problems. This sifiers can be computed using the eigenvalues of the success is mainly rooted in using feature kernel operator, see (Mendelson & Pajor, 2005) and maps and kernel matrices. Some methods references therein. Thus, it is important to know how rely on the eigenvalues/eigenvectors of the reliably the spectrum of the kernel operator can be kernel matrix, while for other methods the estimated, using a finite samples kernel matrix. spectral information can be used to esti- The importance of kernel and its spectral decomposi- mate the excess risk. An important question tion has initiated a series of studies of the concentra- remains on how close the sample eigenval- tion of the eigenvalues and eigenvectors of the kernel ues/eigenvectors are to the population val- matrix around their expected values. Under certain ues. In this paper, we improve earlier re- rate of spectral decay of the kernel integral operator, sults on concentration bounds for eigenval- (Koltchinskii, 1998) and (Koltchinksii & Gin´e,2000) ues of general kernel matrices. For distance showed that the difference between population and and inner product kernel functions, e.g. radial sample eigenvalues and eigenvectors of a kernel ma- basis functions, we provide new concentra- trix is asymptotically normal. A non-asymptotic anal- tion bounds, which are characterized by the ysis of kernel spectrum was first presented in (Shawe- eigenvalues of the sample covariance matrix. Taylor et. al., 2005) and recently revisited in (Jia & Meanwhile, the obstacles for sharper bounds Liao, 2009), where they derive an exponential bound are accounted for and partially addressed. As for the concentration of the sample eigenvalues, us- a case study, we derive a concentration in- ing the bounded difference inequality. Also, using equality for sample kernel target-alignment. concentration results for isoperimetric random vec- tors, (Mendelson & Pajor, 2005) derived a concentra- tion bound for the supremum for the concentration of 1 INTRODUCTION eigenvalues. The bound is determined by the Orlicz norm of the kernel function with respect to the data Kernel methods such as Spectral Clustering, Kernel distribution, the number of dimensions and the num- Principal Component Analysis(KPCA), and Support ber of samples. Vector Machines, are successful approaches in many In this paper, we establish sharper exponential bounds practical machine learning and data analysis problems for the concentration of sample eigenvalues of gen- (Steinwart & Christmann, 2008). The main ingredient eral kernel matrices, which depends on the minimum of these methods is the kernel matrix, which is built distance between a given eigenvalue and the rest of using the kernel function, evaluated at given sample the spectrum. Separately, for Euclidean-distance and points. The kernel matrix is a finite sample estima- inner-product kernels, we provide a set of different tion of the kernel integral operator, which is deter- bounds, which connect the concentration of eigenval- mined by the kernel function. A number of algorithms ues to the maximum spectral gap of the sample covari- rely on the eigenvalues and eigenvectors of the sample ance matrix. Similar results are derived for the eigen- kernel operator/kernel matrix, and employ this infor- vectors of the same kernel matrix. Some experiments mation for further analysis. KPCA (Zwald & Blan- are also designed, to empirically test the presented re- chard, 2006) and kernel target-alignment (Cristianini sults. As a case study, we derive concentration bounds take Assumption 1. Then, for all > 0, we have for kernel target-alignment, which can be used to mea- 2 sure the agreement between a kernel matrix and the 1 1 2n P fj λi(Kn) − ES λi(Kn)j > g ≤ 2 exp(− ); given labels. n n R4 2 p The paper is organized as follows. Section 2 summa- where R = maxx2R k(x; x). rizes the previous results on concentration bounds for eigenvalues. In section 3, we present new concentra- The bound provided by Theorem 2.1 is suboptimal, tion bounds for eigenvalues and eigenvectors of kernel as it only depends on the kernel function k through matrices. Section 4 provides a concentration inequal- R and also it is uniform for all different eigenvalues. ity for sample kernel alignment as a case study. To address the suboptimality of Theorem 2.1, (Jia & Liao, 2009) proposed to bound the concentration of the kernel matrix eigenvalues using the largest eigenvalue 2 PREVIOUS RESULTS of the sample kernel and a quantity θ 2 (0; 1): Theorem 2.2 ((Jia & Liao, 2009)). Let us take As- In this section, we briefly present some of the most sumption 1. Then, for every > 0, there exists relevant earlier results on the concentration of the ker- 0 < θ ≤ 1, such that nel matrix eigenvalues. The main assumption followed throughout the paper is summarized as 1 1 22 P fj λi(Kn) − ES λi(Kn)j > g ≤ 2 exp(− ): p n n θ2λ2(K ) Assumption 1. Let S = fx1;:::; xng, xi 2 X ⊆ R , 1 n be a set of samples of size n, independently drawn from n×n distribution P . Let us define Kn 2 R to be a kernel The advantage of this result over the uniform result 1 in Theorem 2.1 is that the concentration bound de- matrix on S, with [Kn]i;j = n k(xi; xj), for Mercer's kernel function k. pends solely on a parameter θ and the largest sample eigenvalue of the kernel matrix. The parameter θ, in The kernel function k defines a kernel integral operator principle, varies as a function of the order of the eigen- by value. However, Theorem 2.2 does not provide any Z hint on estimating an optimal value for the parameter T f(·) = k(x; ·)f(x)dP (x); θ for any specific eigenvalue. In the following section, we provide sharper concentration inequalities for the for all smooth functions f. The eigenvalue equation is spectral decomposition of the kernel matrix. defined by T u(·) = λu(·), where λ is an eigenvalue of T and u : Rp ! R is the corresponding eigenfunction. For data samples S, the kernel operator can be esti- 3 RESULTS mated by the kernel matrix Kn. Then, the eigenvalue problem can be reduced to The main result is presented in Theorem 3.1, which establishes a bound on the concentration of eigenval- Knu = λu: ues of the kernel matrix, for any general kernel func- tion that fulfills Mercer's theorem. In the next sub- The solutions of the above equation are denoted by sections, we provide concentration bounds for the spec- λi(Kn) and ui(Kn), i = 1; : : : ; n. For simplicity of tral decomposition of the distance and inner product the paper, we assume the eigenvalues of kernel oper- kernels. Our first result is inspired by the following ator/matrix are not identical and are indexed in de- observation: we built kernel matrices, using Gaussian creasing order, i.e. λ1 > λ2 > : : : . The focus of this kernels from 100 samples drawn from a 5-dimensional paper is mainly in finding bounds for the concentra- multivariate Gaussian. We repeat this example 1000 1 1 times. The boxplot of the first 15 eigenvalues of the tions of type: P fj n λi(Kn) − n ES λi(Kn)j ≤ g and 1 1 Gaussian kernel matrix are depicted in Figure 1. From similarly P fk n ui(Kn)− n ES ui(Kn)k ≤ g, for all i = 1; : : : ; n. We denote the finite dimensional unit sphere the boxplot, we can see that the distance between any p p−1 p−1 p in R by S , i.e. S := fw : w 2 R ; kwk2 = 1g, box's corner and location of median decreases as the and the Lipschitz norm of function f by jfjL. gap between corresponding eigenvalue and the spec- trum of the kernel matrix decreases. This suggests The following theorem is presented in (Shawe-Taylor that the concentration of the eigenvalues might be et. al., 2005) and provides a uniform bound that de- controlled by the distance of that eigenvalue to the pends only on the supremum of the diagonal elements spcetrum. None of the earlier results, i.e. Theorems of the kernel matrix. This theorem requires that the 2.1 and 2.2, provide such a characterization. Theo- kernel is bounded. rem 3.1 establishes a concentration bound, which il- Theorem 2.1 ((Shawe-Taylor et. al. 2005) ). Let us lustrates the aforementioned phenomenon. following: for ci ≥ 0; 1 ≤ i ≤ n, 0 9 sup jg(x1; : : : ; xn) − g(x1; : : : ; xi; xi+1; xn)j ≤ ci 0 x1;:::;xn;xi 8 sup jg(x1; : : : ; xn) − gi(x1; : : : ; xi−1; xi+1; xn)j ≤ ci; x1;:::;xn 7 0 0 where x1;:::; xn are independent copies of x1;:::; xn 6 n−1 and gi : A ! R; i = 1; : : : ; n. Then, for all > 0 we have P fjg(x1;:::; xn) − g(x1;:::; xn)j > g ≤ 5 E Magnitude 22 exp(− Pn 2 ). i=1 ci 4 Proof of Theorem 3.1. Let us define submatrix 3 n n−1×n−1 n Kn 2 R with [Kn ]i;j = k(xi; xj); i; j = 1; : : : ; n − 1. By the Cauchy's interlacing lemma, 2 n we have λi(Kn) − λi(Kn ) ≤ λi(Kn) − λi+1(Kn) = λi;i+1(Kn).