<<

New Probabilistic Bounds on Eigenvalues and Eigenvectors of Random Matrices

Nima Reyhani Hideitsu Hino Ricardo Vig´ario Aalto University, School of Science Waseda University Aalto University, School of Science Helsinki, Finland Tokyo, Japan Helsinki, Finand nima.reyhani@aalto.fi [email protected] ricardo.vigario@aalto.fi

Abstract et. al., 2002) and (Jia & Liao, 2009) are good examples of such procedure. Also, the Rademacher complexity, Kernel methods are successful approaches for and therefore the excess loss of the large margin clas- different machine learning problems. This sifiers can be computed using the eigenvalues of the success is mainly rooted in using feature kernel operator, see (Mendelson & Pajor, 2005) and maps and kernel matrices. Some methods references therein. Thus, it is important to know how rely on the eigenvalues/eigenvectors of the reliably the spectrum of the kernel operator can be kernel , while for other methods the estimated, using a finite samples kernel matrix. spectral information can be used to esti- The importance of kernel and its spectral decomposi- mate the excess risk. An important question tion has initiated a series of studies of the concentra- remains on how close the sample eigenval- tion of the eigenvalues and eigenvectors of the kernel ues/eigenvectors are to the population val- matrix around their expected values. Under certain ues. In this paper, we improve earlier re- rate of spectral decay of the kernel integral operator, sults on concentration bounds for eigenval- (Koltchinskii, 1998) and (Koltchinksii & Gin´e,2000) ues of general kernel matrices. For distance showed that the difference between population and and inner product kernel functions, e.g. radial sample eigenvalues and eigenvectors of a kernel ma- functions, we provide new concentra- trix is asymptotically . A non-asymptotic anal- tion bounds, which are characterized by the ysis of kernel spectrum was first presented in (Shawe- eigenvalues of the sample covariance matrix. Taylor et. al., 2005) and recently revisited in (Jia & Meanwhile, the obstacles for sharper bounds Liao, 2009), where they derive an exponential bound are accounted for and partially addressed. As for the concentration of the sample eigenvalues, us- a case study, we derive a concentration in- ing the bounded difference inequality. Also, using equality for sample kernel target-alignment. concentration results for isoperimetric random vec- tors, (Mendelson & Pajor, 2005) derived a concentra- tion bound for the supremum for the concentration of 1 INTRODUCTION eigenvalues. The bound is determined by the Orlicz norm of the kernel function with respect to the data Kernel methods such as Spectral Clustering, Kernel distribution, the number of dimensions and the num- Principal Component Analysis(KPCA), and Support ber of samples. Vector Machines, are successful approaches in many In this paper, we establish sharper exponential bounds practical machine learning and data analysis problems for the concentration of sample eigenvalues of gen- (Steinwart & Christmann, 2008). The main ingredient eral kernel matrices, which depends on the minimum of these methods is the kernel matrix, which is built distance between a given eigenvalue and the rest of using the kernel function, evaluated at given sample the spectrum. Separately, for Euclidean-distance and points. The kernel matrix is a finite sample estima- inner-product kernels, we provide a set of different tion of the kernel integral operator, which is deter- bounds, which connect the concentration of eigenval- mined by the kernel function. A number of algorithms ues to the maximum spectral gap of the sample covari- rely on the eigenvalues and eigenvectors of the sample ance matrix. Similar results are derived for the eigen- kernel operator/kernel matrix, and employ this infor- vectors of the same kernel matrix. Some experiments mation for further analysis. KPCA (Zwald & Blan- are also designed, to empirically test the presented re- chard, 2006) and kernel target-alignment (Cristianini sults. As a case study, we derive concentration bounds take Assumption 1. Then, for all  > 0, we have for kernel target-alignment, which can be used to mea- 2 sure the agreement between a kernel matrix and the 1 1 2n P {| λi(Kn) − ES λi(Kn)| > } ≤ 2 exp(− ), given labels. n n R4 2 p The paper is organized as follows. Section 2 summa- where R = maxx∈R k(x, x). rizes the previous results on concentration bounds for eigenvalues. In section 3, we present new concentra- The bound provided by Theorem 2.1 is suboptimal, tion bounds for eigenvalues and eigenvectors of kernel as it only depends on the kernel function k through matrices. Section 4 provides a concentration inequal- R and also it is uniform for all different eigenvalues. ity for sample kernel alignment as a case study. To address the suboptimality of Theorem 2.1, (Jia & Liao, 2009) proposed to bound the concentration of the kernel matrix eigenvalues using the largest eigenvalue 2 PREVIOUS RESULTS of the sample kernel and a quantity θ ∈ (0, 1): Theorem 2.2 ((Jia & Liao, 2009)). Let us take As- In this section, we briefly present some of the most sumption 1. Then, for every  > 0, there exists relevant earlier results on the concentration of the ker- 0 < θ ≤ 1, such that nel matrix eigenvalues. The main assumption followed throughout the paper is summarized as 1 1 22 P {| λi(Kn) − ES λi(Kn)| > } ≤ 2 exp(− ). p n n θ2λ2(K ) Assumption 1. Let S = {x1,..., xn}, xi ∈ X ⊆ R , 1 n be a set of samples of size n, independently drawn from n×n distribution P . Let us define Kn ∈ R to be a kernel The advantage of this result over the uniform result 1 in Theorem 2.1 is that the concentration bound de- matrix on S, with [Kn]i,j = n k(xi, xj), for Mercer’s kernel function k. pends solely on a parameter θ and the largest sample eigenvalue of the kernel matrix. The parameter θ, in The kernel function k defines a kernel integral operator principle, varies as a function of the order of the eigen- by value. However, Theorem 2.2 does not provide any Z hint on estimating an optimal value for the parameter T f(·) = k(x, ·)f(x)dP (x), θ for any specific eigenvalue. In the following section, we provide sharper concentration inequalities for the for all smooth functions f. The eigenvalue equation is spectral decomposition of the kernel matrix. defined by T u(·) = λu(·), where λ is an eigenvalue of T and u : Rp → R is the corresponding eigenfunction. For data samples S, the kernel operator can be esti- 3 RESULTS mated by the kernel matrix Kn. Then, the eigenvalue problem can be reduced to The main result is presented in Theorem 3.1, which establishes a bound on the concentration of eigenval-

Knu = λu. ues of the kernel matrix, for any general kernel func- tion that fulfills Mercer’s theorem. In the next sub- The solutions of the above equation are denoted by sections, we provide concentration bounds for the spec- λi(Kn) and ui(Kn), i = 1, . . . , n. For simplicity of tral decomposition of the distance and inner product the paper, we assume the eigenvalues of kernel oper- kernels. Our first result is inspired by the following ator/matrix are not identical and are indexed in de- observation: we built kernel matrices, using Gaussian creasing order, i.e. λ1 > λ2 > . . . . The focus of this kernels from 100 samples drawn from a 5-dimensional paper is mainly in finding bounds for the concentra- multivariate Gaussian. We repeat this example 1000 1 1 times. The boxplot of the first 15 eigenvalues of the tions of type: P {| n λi(Kn) − n ES λi(Kn)| ≤ } and 1 1 Gaussian kernel matrix are depicted in Figure 1. From similarly P {k n ui(Kn)− n ES ui(Kn)k ≤ }, for all i = 1, . . . , n. We denote the finite dimensional unit sphere the boxplot, we can see that the distance between any p p−1 p−1 p in R by S , i.e. S := {w : w ∈ R , kwk2 = 1}, box’s corner and location of median decreases as the and the Lipschitz norm of function f by |f|L. gap between corresponding eigenvalue and the spec- trum of the kernel matrix decreases. This suggests The following theorem is presented in (Shawe-Taylor that the concentration of the eigenvalues might be et. al., 2005) and provides a uniform bound that de- controlled by the distance of that eigenvalue to the pends only on the supremum of the diagonal elements spcetrum. None of the earlier results, i.e. Theorems of the kernel matrix. This theorem requires that the 2.1 and 2.2, provide such a characterization. Theo- kernel is bounded. rem 3.1 establishes a concentration bound, which il- Theorem 2.1 ((Shawe-Taylor et. al. 2005) ). Let us lustrates the aforementioned phenomenon. following: for ci ≥ 0, 1 ≤ i ≤ n,

0 9 sup |g(x1, . . . , xn) − g(x1, . . . , xi, xi+1, xn)| ≤ ci 0 x1,...,xn,xi

8 sup |g(x1, . . . , xn) − gi(x1, . . . , xi−1, xi+1, xn)| ≤ ci, x1,...,xn 7 0 0 where x1,..., xn are independent copies of x1,..., xn 6 n−1 and gi : A → R, i = 1, . . . , n. Then, for all  > 0 we have P {|g(x1,..., xn) − g(x1,..., xn)| > } ≤ 5 E Magnitude 22 exp(− Pn 2 ). i=1 ci 4 Proof of Theorem 3.1. Let us define submatrix 3 n n−1×n−1 n Kn ∈ R with [Kn ]i,j = k(xi, xj), i, j = 1, . . . , n − 1. By the Cauchy’s interlacing lemma, 2 n we have λi(Kn) − λi(Kn ) ≤ λi(Kn) − λi+1(Kn) = λi,i+1(Kn). Using the bounded difference inequality, 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Pn 2 2 Eigenvalues see Appendix 1, and taking j=1 cj = nλi,i+1(Kn) we obtain the claimed result.

Figure 1: Boxplot of 15 first eigenvalues of Gaussian Note that, λi,i+1(Kn) can be bounded by θλ1 for some kernels with 5 dimensional Gaussian samples, drawn θ ∈ (0, 1], which reduces to the bound in Theorem from N (0,I ). 2.2. With extra conditions on the 5 of kernel matrix we can derive the bound in Theorem 2.1 by the above result, for details please see Corollary Theorem 3.1 (Probabilistic bound on eigenvalues of 2 in (Jia & Liao, 2009). kernel matrices). Let us take (Assumption 1) and de- Corollary 3.1. Using the assumptions of Theorem fine λi,i+1(Kn) := λi(Kn) − λi+1(Kn). Then, the fol- 3.1, the sum of the top and low eigenvalues are lowing inequality holds for sample eigenvalues, λi(Kn), concentrated around their mean, as follows: i = 1, . . . , n − 1. k k 2 1 X 1 X 1 1 2n P {| λi(Kn) − ES λi(Kn)| > } P {| λi(Kn) − S λi(Kn)| > } ≤ exp(− ). n n E 2 i=1 i=1 n n λi,i+1(Kn) (1) 2n2 ≤ exp(− 2 ), λ1,k+1(Kn) Experimental results to validate the above bound are n n given in Example 1. To prove Theorem 3.1, we in- 1 X 1 X P {| λi(Kn) − E λi(Kn)| > } troduce Cauchy’s interlacing lemma, as a particular n n i=k i=k matrix perturbation result and bounded difference in- 2n2 equality. ≤ exp(− ), λ2 (K) Definition 1 (Principal Submatrix). The principal k,n submatrix of A of order n − 1 is the matrix obtained where λ1,k+1(Kn) = λ1(Kn) − λk+1(Kn) and by taking the first n − 1 rows and columns of matrix λk,n(Kn) = λk(Kn) − λn(Kn). A. The above inequality is useful to derive a probabilistic Lemma (Cauchy’s Interlacing lemma, (Horn & John- bound for the population risk of Kernel PCA, using son, 1985 )). Let B be a principal submatrix of the the empirical risk. We leave further details for future Hermitian matrix A, of order n − 1, with eigenvalues work. µ1 ≥ · · · ≥ µn−1. Also, we denote the eigenvalues of A by λ1 ≥ · · · ≥ λn. Then, the following property holds for eigenvalues of A and B: 3.1 DISTANCE AND INNER PRODUCT KERNELS λ1 ≥ µ1 ≥ λ2 ≥ µ2 ≥ · · · ≥ µn−1 ≥ λn. In the following, we present results for distance and Theorem (Bounded Difference Inequality, (Ledoux & inner product kernel functions, which are commonly Talagrand, 1991) and (Shawe-Taylor et. al., 2005)). used in practical data analysis, such as Radial Basis Let x1,..., xn be independent random variables taking Functions in SVM (Steinwart & Christmann, 2008), values in a set A and g : An → R satisfies one of the or Laplacian kernels in Spectral Clustering (Bengio et. al., 2003). The distance kernel matrix and inner prod- have, uct kernel matrix are defined by k(x , x ) = f(kx − i j i 2 2 2 > |kxi − x1k − kxi − x10 k | xjk2) and k(xi, xj) = f(xi xj), where f : R → R is a 2 2 > > > > smooth function, which has a Bochner’s integral rep- = |x1 x1 − x10 x10 − 2x1 xi + 2x10 xi| resentation (Steinwart & Christmann, 2008). For the > > > > = |y Σy + 2y 0 Σy − y 0 Σy 0 − 2y Σy | rest of the paper, except section 4, the spectral norm 1 1 1 i 1 1 1 i 2 is denoted by k · k. The results for concentration of ≤ 3M λ1,p(Σ). eigenvalues of the distance kernel matrix is presented Therefore, the spectral norm of the error matrix can in the Theorem 3.2. be bounded by, Theorem 3.2. Let us take (Assumption 1), with 6M 2 2 kEk ≤ √ |f| λ (Σ). (3) k(xi, xj) = f(kxi−xjk2). We assume x1,..., xn have n L 1,p a nonsingular sample covariance matrix Σ. Also we By the bounded difference inequality and eq. 3 we ob- assume, for r.v.s y1,..., yn with identity covariance tain the desired result. matrix and same distribution as x1,..., xn, kyik2 ≤ 1 M and xi = Σ 2 yi, ∀i = 1, . . . n holds. The first and Similar results to Corollary 3.1 can be derived, where last eigenvalues of Σ are denoted by λ1(Σ) and λp(Σ), the kernel eigenvalues in the exponential bound are respectively. Then, for λ1,p = λ1(Σ) − λp(Σ), we have replaced by eigenvalues of the sample covariance ma- trix, in the same way as they appear in the exponential 1 1 bound of Theorem 3.2. The concentration bounds for P {| λ (K) − λ (K)| > } ≤ n i ES n i the inner product kernel can be derived in a similar n22 way as in Theorem 3.2. The sketch of the derivation exp(− 4 2 2 ). (2) is provided in Remark 3.1. 18M |f|Lλ1,p Remark 3.1 (concentration bound for smooth in- ner product kernels). For the inner product kernel Proof. Let us denote the perturbation of matrix K by as defined earlier in this section with |f|L-Lipschitz Kn defined by Kn := 1 f(kx − x k2), ∀i, j = 2, . . . , n  1 1 i,j n i j 2 function we have P λi(Kn) − ES λi(Kn) >  ≤ n n 1 0 n n 0 n 2 2 o and Kj,1 = K1,j := n f(kx1 − xjk), ∀j = 1 , . . . , n, n  exp − 4|f|2 M 4λ2 (Σ) . The proof is similar to as of where x10 is independent copy of xi, ∀i = 1, . . . , n. L 1,p Then, K = Kn + E. Suppose the perturbation Theorem 3.2. C term E has a small spectral norm. Then, by the Theorem 3.2 and the result in Remark 3.1 connects the Wielandt Theorem (Horn & Johnson, 1985), we have concentration of the eigenvalues of the distance and in- 1 λ (K ) − 1 λ (Kn) ≤ 1 kEk. The operator norm of n i n n i n n ner product kernel matrices to the difference between E can be computed as follows. the largest and smallest eigenvalues of the covariance matrix. In high dimensional data, the first eigenvalue kEk of the sample covariance matrix is likely to become n n = kKn − Kn k = sup |h(Kn − Kn )z, zi| rather large, implying that the eigenvalues of the dis- z∈Sn−1 tance kernel matrix are not concentrated around the n 2 X 2 2 mean. This result suggests that it may be better not to = sup z z [f(kx − x k ) − f(kx − x 0 k )] 1 i i 1 2 i 1 2 use smooth distance, or inner product kernels in high n z∈Sn−1 i=1 dimensional data. The results of this section hold also n 2 X 2 2 for a wider class of kernel functions that behaves al- ≤ sup |z1| zi[f(kxi − x1k2) − f(kxi − x10 k2)] n n−1 most like the distance or the inner product kernels. z∈S i=1 1 The characterization of the wider class of this func- n ! 2 2 X 2 tions is summarized in Remark 3.2. ≤ sup zi n n−1 Remark 3.2. The result of theorem 3.2 also holds for z∈S i=1 1 other kernels that satisfy one of the following condi- n ! 2 X 2 2 2 tions. · [f(kx − x k ) − f(kx − x 0 k )] i 1 2 i 1 2 0 0 i=1 1. |k(x1, x2) − k(x1, x2)| 1 0 2 0 2 n ! 2 ≤ CX kx1 − x1k2 − kx2 − x2k2 , 2|f|L X 2 2 2 ≤ kxi − x1k2 − kxi − x10 k2 , 0 0 0 > 0 > 0 n 2. |k(x1, x2) − k(x1, x2)| ≤ CX x1 x1 − x2 x2 , i=1 0 where CX and CX are constants depending on the where we used H¨older’s inequality. Now, by the smoothness of the kernel function and the data dis- bounded Manifold assumption, i.e. kyik ≤ M, we tribution. C In order to check the bounds in Theorems 3.1 and Combining the preceding bound with eq. (3) reads 3.2, we run two numerical experiments with Gaussian    2 2  kernels. The details and results are summarized in 1 1 n  P λi(Kn) − ES λi(Kn) >  ≤ exp − , Example 1. n n γ2 Example 1. We draw 100 samples from the standard where normal distribution, and compute the kernel matrix us- 2 n ing a Gaussian kernel function k(x, y) = exp(−0.5||x− λ1,p(Σ) λ1,p(Σ) X 1 γ = 6M 2|f| √ + 36M 4|f|2 , y||2). We repeat this procedure a 1000 times, and com- L n L n λ2 (K ) 2 j=1 j,i n pute the empirical estimate of the left hand and right j6=i hand sides of inequality (1). The results are depicted in Figure 2 (top), where solid, dashed, and dotted lines correspond to left hand side of eq. (1) for i = 1, 2, 3, respectively. Lines with different marks correspond to right hand side of inequality (1) for i = 1, 2, 3, respec- LHS.ev1 tively. As can be seen, the concentration of eigenvalues LHS.ev3 varies by their order. The bound presented in eq. (1) LHS.ev5 tracks the concentration changes for each eigenvalue LHS.ev7 separately. RHS.ev1 RHS.ev3 Similarly we repeat the above example for multivariate probability RHS.ev5 Gaussian samples N (0,I2) and N (0,I5), to check the RHS.ev7 bound in Theorem 3.2. Ip denotes the in Rp. The empirical estimate of left hand side and right hand side of inequality (2) are drawn in Figure 2 (bottom). As the experiments illustrate, the concen- tration of eigenvalues change with the of 0.0 0.2 0.4 0.6 0.8 1.0 samples, which can be captured by the bound provided 0.00 0.02 0.04 0.06 0.08 0.10 in inequality (2). epsilon One of the main obstacles to improve the bounds de- rived either in this or earlier section is due to the limitation of the matrix perturbation results. For ex- ample, the results in Theorem 3.2 and Theorem 2.1 rely on the first order eigenvalue expansion λi(K˜ ) = > ˜ 2 ˜ λi(K) − ui (K)Eui(K) + O(kK − Kk ), where K is a perturbation of the matrix or compact linear opera- tor K and E = K˜ − K. However, these results can be slightly improved by using the second order expansion, as described in (Kato, 1996): LHS.ev1.dim.2 > LHS.ev2.dim.2 ˜ probability λi(K) = λi(K) − ui (K)Eui(K) LHS.ev1.dim.5 n > > LHS.ev2.dim.5 X uj (K)Eui(K)ui (K)Euj(K) 3 + + O(kEk ). RHS.dim.2 λ − λ j=1 j i RHS.dim.5 j6=i The above expansion holds if the spectral norm of the 0.0 0.2 0.4 0.6 0.8 1.0 error operator/matrix E is smaller than half of the 0.00 0.02 0.04 0.06 0.08 0.10 distance between the eigenvalue λi and the rest of the spectrum of K. In the notation of Theorem 3.2, E = epsilon n Kn − Kn is symmetric. Therefore, the perturbation of the eigenvalues of the kernel matrix can be bounded by Figure 2: Empirical study on concentrations of eigen- values. The top figure corresponds to inequality (1) n X 1 for 1st, 3rd, and 5th eigenvalues. The bottom figure |λ (K ) − λ (Kn)| ≤ kEk + kEk2 . (4) i n i n 2 2 |λ − λ | corresponds to inequality (2). See Example 1 for de- j=1 j i j6=i tails. and λj,i(Kn) = λj(Kn) − λi(Kn). corresponding to i-th eigenvalue, λ1,p(Σ) := λ1(Σ) − Pn 1 λp(Σ), and Ri(Kn) := j=1 |λ (K )−λ (K )| . Another direction to improve this result is to improve j6=i i n j n n the computation of the spectral norm kKn − K k, so n n that the dependencies between elements of E are taken Proof. Let us define the perturbed matrix Kn by n 1 n into account. This is left for future work. [Kn ]i,j = n k(xi, xj), ∀i, j = 2, . . . , n and [Kn ]1,j = n 0 n Kj,1 = k(x1, xj). Therefore, Kn = Kn + E. By defi- 3.2 EIGENVECTORS OF INNER nition of fw we have, PRODUCT AND DISTANCE n n |fw(ui(Kn)) − fw(ui(Kn ))| = |fw(ui(Kn) − ui(Kn ))| KERNELS n ≤ kwkkui(Kn) − ui(Kn )k,

In this section, we provide the concentration bounds n where ui(Kn) and ui(Kn ) are eigenvectors of matri- for the eigenvectors of the distance and inner prod- n ces Kn and Kn . Now, we need to compute the quan- uct kernel matrices. The results mainly rely on the n tity kui(Kn) − ui(Kn )k. Using the perturbation re- eigenfunction perturbation expansion summarized in n sult of Theorem 3.3, we have kui(Kn) − ui(Kn )k ≤ the following theorem, which is due to (Kato, 1996). n−1 kEkRi(Kn). Then, for all w ∈ S , applying the Theorem 3.3. [eigenfunction expansion] Let K˜ = bounded difference inequality result in the following K +E, and assume kEk is smaller than half of the dis- concentration bound. tance between eigenvalue λi and the rest of the eigen- values, then P {|hw, ui(Kn) − ES ui(Kn)i| > } 2 n >  2  X uj Eui ≤ exp − . (6) u˜ = u + u + O(kEk2), (5) 2 2 i i λ − λ j nkEk Ri (Kn) supw kwk2 j=1 j i j6=i By plugging eq. (3) into eq. (6) we obtain the claimed result. where ui is the i-th eigenvector of matrix K, with cor- responding eigenvalue λi, and u˜ is the eigenvector of The concentration bound in Lemma 1 can be extended K˜ corresponding to λi. to real-valued convex Lipschitz functions. An imme- diate application of Lemma 1 is to derive a uniform Note that, by eq. (5), when |λj − λi| is small, the term concentration bound for the kernel matrix eigenvec- E has a large norm, which results in an instability λj −λi tors. This result is provided in Corollary 3.2. of eigenvector u due to the perturbation. This is the i Corollary 3.2 (uniform concentration of kernel eigen- case for all eigenvectors that correspond to small eigen- vectors). With the same assumptions made in Theo- values. Such phenomenon is empirically studied in (Ng rem 1, we have et. al., 2001). In the following, we provide pointwise 2 and uniform bounds for the concentration of eigenvec- P {kui(Kn) − ES ui(Kn)|S k > } ≤ 2 exp 2n − c , tors using eq. (5). −1 4 2 2 2 where c = 18M |f| R (Kn)λ (Σ). The following results are slightly different from those L i 1,p in (Zwald & Blanchard, 2006), as it contain one extra In the above bound, the positive first term inside the projection into the given samples. Moreover, the proof exponential function does not bring any harm, since presented here holds also for kernels with an infinite the first eigenvalue√ of the sample covariance matrix is dimensional feature space. In the following we denote of order O( n), which cancels out the effect of the the evaluation of function u(·): Rp → R on samples S 2n term inside the exponential function. The proof by u|S := [u(x1),..., u(xn)]. relies on Lemma 1 and the ε-Net argument (Ledoux & Lemma 1. Let us take the assumptions made in The- Talagrand, 1991). orem 3.2. We consider a real valued function fw(·): Proof of Corollary 3.2. Let us denote the ε-Net of n n R → R, x 7→ h·, wi, w ∈ R . Further, assume the compact unit sphere Sn−1 by N , for some ε ∈ 2 n−1 ε k(x, y) = f(kx − yk2). Then, for every w ∈ S (0, 1). Let x ∈ Sn−1 such that kuk = hx, ui, where we have the following concentration result: u := ui(Kn). Now, we can choose y ∈ Nε so that kx − yk ≤ ε holds. Then, we have |hx, ui − hy, ui| ≤ P {|fw(ui(Kn)) − ES fw(ui(Kn)|S )| > } kukkx − yk ≤ εkuk. By the triangle inequality, we ! 2 have |hy, ui| ≥ |hx, ui|−|hx, ui−hy, ui| ≥ kuk−εkuk. ≤ exp − 4 2 2 2 , Putting all together, we obtain 18M |f|LRi (Kn)λ1,p(Σ) 1 where u (K ) is the eigenvector of kernel matrix K max |hy, ui| ≤ kuk ≤ max |hu, yi|. i n n y∈N 1 − ε y∈Nε By using the upper bound in the preceding inequality, target-alignment or kernel alignment is defined by we have hy ⊗ y, ki hy ⊗ y, ki A(y) = = , ky ⊗ ykkkk kkk P {ku − ES u|S k > } R   where hf, gi = X 2 f(x, z)g(x, z)dP (x)dP (z). Simi- = P sup |hw, u − ES u|S i| >  larly, the sample kernel target alignment of Kn is de- n−1 w∈S fined by  1  ≤ P max |hw, u − u| i| >  > ES S hKn,Y ⊗ Y iF Y KnY 1 − ε w∈Nε A(Kn) = p = , hY ⊗ Y,Y ⊗ Y iF hK,Kni nkKnkF ≤ |Nε|P {|hw, u − ES u|S i| > } 2 where Y = (y , . . . , y ), h·, ·i is the Frobenius inner ≤ 2|Nε| exp(−c ), 1 n F product and k · kF is the Frobenius norm. In this where in the second we have used the union bound. section, we drop the subscript F . The following bound By Lemma 9.5. in (Ledoux & Talagrand, 1991), for is suggested in (Jia & Liao, 2009), ε = 1 we have |N | ≤ 6n. Putting all together we get 2 ε  22(n − 1)2  the claimed result. P {|A(K ) − A(y)| > } ≤ 2 exp − , n nC2(θ) −1 2n−1 where C(θ) = |A(Kn)|θ (m−(m−1)θ + ) and θ kKnk The result on the eigenvectors of distance kernel can s λi(Kn) is defined by θ = 1 − maxs=1,...,n mini=1,...,n−1 , be applied to a broader class of kernels that satis- λi(Kn) s fies Lipschitz smoothness, see Remark 3.2. In par- where Kn is the submatrix of Kn, where the s-th row ticular, for the inner product kernels, we have kEk ≤ and column are replaced by 0 vectors. Here, we pro- |f| λ (Σ) vide another concentration bound, which depends only 2M 2 L √1,p . Plugging this into inequality (6), we n on eigenvalues of the sample kernel matrix, and can be get the desired bounds. approximated by the ratio between the first and sec- The concentration bounds presented in Theorem 1 and ond largest eigenvalues of the kernel matrix. Corollary 3.2 can also be slightly improved by using a Theorem 4.1. Let us take (Assumption 1), and A(y) higher order expansion of the operator perturbation. and A(K) are defined as above. Then, we have For more details on higher order expansions of eigen- vectors see (Kato, 1996). P {|A(Kn) − A(y)| > }   22 4 CASE STUDY: KERNEL ≤ 2 exp −    . A(K ) 1 − kKnk + 2 + 1 1 TARGET-ALIGNMENT n n−1 L n−1 L

In this section, we provide an example of using the in- Proof.(Sketch) The proof follows from the Cauchy in- gredients of the proof in Section 3, for deriving a con- terlacing lemma and the bounded difference inequality. n s centration inequality for the sample kernel alignment. Let us define Kn as Kn as above, where s = n. Then, This is a good example to show how the combination we have of a proper matrix perturbation result and the concen- > n n n,> n hKn,YY i hKn ,Y Y i tration inequality provides an informative, simple and |A(Kn) − A(K )| ≤ − n nkK k (n − 1)kKnk easy to compute concentration bound. n n > > hKn,YY i hKn,YY i Kernel target-alignment is proposed in (Cristianini et. ≤ − n nkKnk (n − 1)kK k al., 2002), for measuring the agreement between a ker- n hK ,YY >i hKn,Y nY n>i nel matrix and the given learning task, i.e. classifica- n n + n − n tion. By changing the eigenvalues of the kernel matrix, (n − 1)kK k (n − 1)kKn k one can improve the alignment to the labels, i.e. tar- > 1 1 ≤ hKn,YY iF n − get values, which results in improvement of the learn- (n − 1)kKn k kKnk ing performance. This approach has been proposed 2n − 1 + . for kernel learning (for more detail see (Cristianini et. n (n − 1)kKn k al., 2002)). (Jia & Liao, 2009) proposed to use the bound in Theorem 2.2 to derive a concentration in- The result follows by plugging in the following inequal- equality for sample kernel alignment. Suppose a sam- ity into the above. n p v v ple set {(xi, yi)}i=1, xi ∈ R and yi ∈ {−1, +1} is un−1 un−1 n×n uX uX given. Let Kn ∈ be a Mercer kernel matrix de- 2 n 2 R L := t λi (Kn) ≤ kKn kF ≤ t λi (Kn). fined by [Kn]i,j = k(xi, xi), ∀1 ≤ i, j ≤ n. The kernel i=2 i=1 Therefore, we have funded by the Academy of Finland, through Centres of Excellence Program 2006-2011. n |A(Kn) − A(Kn )| > n hKn,YY i (n − 1)kKnk − kKn k 2n − 1 1 References = n + n kKnk (n − 1)kKn k n − 1 kKn k   [1] Bengio, Y., Vincent, P., Paiement, J., Delalleau, 1 kKnk 1 1 ≤ A(Kn) − + 2 + . O., Ouimet, M., Le Roux, N., Spectral cluster- n − 1 L n − 1 L ing and kernel PCA are learning eigenfunctions. Technical Report TR 1239, University of Montreal,

kKnk 2003. In the preceding results, the term L can be ap- proximated by λ1(Kn) . Therefore, for sufficiently large [2] Jia, L. and Liao, S., Accurate probabilistic error λ2(Kn) samples size, the gap between first and second eigen- bound for eigenvalues of kernel matrix, LNCS, Vol. values control the concentration of the sample kernel 5828/2009, 162-175, 2009. alignment. [3] Mendelson, S., and Pajor, A., Ellipsoid approxima- tion using random vectors, COLT 2005, Springer- 5 CONCLUDING REMARKS Verlag, LNAI 3359, 429-443, 2005.

Kernel methods are very successful approaches in solv- [4] Koltchinskii, V., Asymptotics of spectral projec- ing different machine learning problems. This success tions of some random matrices approximating inte- is mainly rooted in using feature maps and kernel func- gral operators. Progress in Probability 43, 191-227, tions, which transform the given samples into a possi- 1998. bly higher dimensional space. [5] Koltchinskii, V., Gin´e,E., Random matrix approx- Kernel matrices/operators can be characterized by the imation of spectra of integral operators, Bernoulli properties of their spectral decomposition. Also, in 6(1), 113-167, 2000. the computation of the excess risk, we eventually need [6] Shawe-Taylor, J., Williams, C.K., Cristianini, N., the eigenvalue information. Concentration inequali- Kandola, J., On the eigenspectrum of the gram ties for the spectrum of kernel matrices measure how matrix and the generalization error of kernel-PCA. close the sample eigenvalues/eigenvectors are to the IEEE Transactions on Information Theory 51(7), population values. The previous concentration results 2510-2522, 2005. are either suboptimal, or computing the bound is im- practical. This paper improves upon the optimality [7] Zwald, L., Blanchard, G., On the convergence of and computational feasibility of the previous results, eigenspaces in kernel principal component analysis, by applying Cauchy’s interlacing lemma. Moreover, NIPS,1649-1656, MIT Press, Cambridge, 2006. for inner product and Euclidean distance kernels, e.g. [8] Cristianini, N., Kandola, J., Elisseeff, A., Shawe- radial basis function (Steinwart & Christmann, 2008), Taylor, J., On kernel-target alignment, Journal of we derive new types of bounds, which connect the con- Machine Learning Research 1, 131,2002. centration of sample kernel eigenvalues and eigenvec- tors to the eigenvalues of the sample covariance ma- [9] Steinwart, I., and Christmann, A., Support Vector trix. This result may explain the poor performance Machines, Springer, 2008. of some nonlinear kernels in very high dimensions, as the largest eigenvalue of the sample covariance matrix [10] Horn, R. , Johnson, C., Matrix Analysis, Cam- become very large. As an interesting case study, we es- bridge University Press, New York, 1985. tablish a computationally less demanding, simple and [11] Andrew Y. Ng , Alice X. Zheng , Michael I. Jor- informative bound for the concentration of the sample dan, Link analysis, eigenvectors and stability, Pro- kernel target-alignment. ceedings of the 17th international joint conference The results can be improved, for example by using on Artificial Intelligence, 903-910, Seattle, USA, more careful calculations of the spectral norms in 2001. bounded difference computation, other operator per- [12] Kato, T., Perturbation theory for linear operators, turbation results, or local concentration inequalities. Springer-Verlag, New York, 1996. Acknowledgements [13] Ledoux, M., and Talagrand, M., Probability in Banach Spaces, Isoperimetry and processes, Authors acknowledge the reviewers’ comments, which Springer-Verlag, Berlin, 1991. improved the paper considerably. NR and RV were