Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

Infinite Kernel Learning: Generalization Bounds and Algorithms

Yong Liu,1 Shizhong Liao,2 Hailun Lin,1 Yinliang Yue,1∗ Weiping Wang1 1Institute of Information Engineering, CAS 2School of Computer Science and Technology, Tianjin University {liuyong,linhailun,yueyinliang,wangweiping}@iie.ac.cn, {szliao,yongliu}@tju.edu.cn

Abstract these measures usually have slow convergence rates with or- O √1 n der at most n , where is the size of data set. Kernel learning is a fundamental problem both in re- cent research and application of kernel methods. Exist- Instead of learning a single kernel, multiple kernel learn- ing kernel learning methods commonly use some mea- ing (MKL) follows a different route to learn a set of sures of generalization errors to learn the optimal ker- combination coefficients of basic kernels (Lanckriet et al. nel in a convex (or conic) combination of prescribed 2004; Bach, Lanckriet, and Jordan 2004; Ong, Smola, and basic kernels. However, the generalization bounds de- Williamson 2005; Sonnenburg et al. 2006; Rakotomamonjy rived by these measures usually have slow convergence et al. 2008; Kloft et al. 2009; 2011; Cortes, Mohri, and rates, and the basic kernels are finite and should be spec- Rostamizadeh 2010; Cortes, Kloft, and Mohri 2013; Liu, ified in advance. In this paper, we propose a new ker- Liao, and Hou 2011). Within this framework, the final ker- nel learning method based on a novel measure of gen- nel is usually a convex (or conic) combination of finite ba- eralization error, called principal eigenvalue proportion sic kernels that should be specified in advance by users. (PEP), which can learn the optimal kernel with sharp generalization bounds over the convex hull of a possibly To improve the accuracy of MKL, some researchers stud- infinite set of basic kernels. We first derive sharp gener- ied the problem of learning a kernel in the convex hull of a alization bounds based on the PEP measure. Then we prescribed set of continuously parameterized basic kernels design two kernel learning algorithms for finite kernels (Micchelli and Pontil 2005; Argyriou, Micchelli, and Pon- and infinite kernels respectively, in which the derived til 2005; Argyriou et al. 2006; Gehler and Nowozin 2008; sharp generalization bounds are exploited to guarantee Ghiasi-Shirazi, Safabakhsh, and Shamsi 2010). This flexibil- faster convergence rates, moreover, basic kernels can be ity of the kernel class can translate into significant improve- learned automatically for infinite kernel learning instead ments in terms of accuracy. However, the measures used for of being prescribed in advance. Theoretical analysis and the existing finite and infinite MKL algorithms are usually empirical results demonstrate that the proposed kernel radius-margin (SVM objective) or other related regulariza- learning method outperforms the state-of-the-art kernel learning methods. tion functionals, which usually have slow convergence rate. In this paper, we introduce a novel measure of gener- alization errors based on spectral analysis, called principal Introduction eigenvalue proportion (PEP), and create a new kernel learn- ing method over a possibly infinite set of basic kernels. We Kernel methods have been successfully applied in solving first derive generalization bounds with convergence rates of various problems in community. The per- n O( log( ) ) formance of these methods strongly depends on the choice order n based on the PEP measure. By minimizing of kernel functions (Micchelli and Pontil 2005). The earli- the derived PEP-based sharp generalization bounds, we de- est learning method of a kernel function is cross-validation, sign two new kernel learning algorithms for finite kernel and which is computationally expensive and only applicable to infinite kernels respectively: one can be formulated as a con- kernels with a small number of parameters. Minimizing the- vex optimization and the other one as a semi-infinite pro- oretical estimate bounds of generalization error is an alterna- gram. For infinite case, the basic kernels are learned auto- tive to cross-validation (Liu, Jiang, and Liao 2014). To this matically instead of being specified in advance. Experimen- end, some measures are introduced: such as VC dimension tal results on lots of benchmark data sets show that our pro- (Vapnik 2000), covering number (Zhang 2002), Rademacher posed method can significantly outperform the existing ker- complexity (Bartlett and Mendelson 2002), radius-margin nel learning methods. Theoretical analysis and experimen- (Vapnik 2000), maximal discrepancy (Anguita et al. 2012), tal results demonstrate that our PEP-based kernel learning etc. Unfortunately, the generalization bounds derived by method is sound and effective.

∗ Related Work Corresponding author Copyright c 2017, Association for the Advancement of Artificial In this subsection, we introduce the related measures of gen- Intelligence (www.aaai.org). All rights reserved. eralization error and infinite kernel learning methods.

2280 Measures of Generalization Error In recent years, lo- slow convergence rates and usually only applicable for clas- cal Rademacher complexities were used to derive sharper sification. In this paper, we propose an infinite kernel learn- generalization bounds. Koltchinskii and Panchenko (2000) ing method, which use the PEP-based measure with fast con- first applied the notion of the local Rademacher complex- vergence rate and can be applicable both for classification ity to obtain data dependent upper bounds using an iterative and regression. method. Lugosi and Wegkamp (2004) established the oracle inequalities using this notion and demonstrated its advan- Notations and Preliminaries tages over those based on the complexity of the whole model We consider with a sample S = class. Bartlett, Bousquet, and Mendelson (2005) derived { x ,y }n n ( i i) i=1 of size drawn i.i.d. from a fixed, but un- generalization bounds based on a local and empirical version known probability distribution P on X×Y, where X de- of Rademacher complexity, and further presented some ap- notes the input space and Y denotes the output domain, plications to classification and prediction with convex func- Y = {−1, +1} for classification, Y⊆R for regression. tion classes. Koltchinskii (2006) proposed new bounds in  Let K : X×X→ R be a kernel function, and K = 1 n terms of the local Rademacher complexity, and applied these K(xi, xj) be its corresponding kernel matrix. For bounds to develop model selection techniques in abstract n i,j=1 most of MKL algorithms, the kernel is learned over a convex risk minimization problems. Srebro, Sridharan, and Tewari combination of finite basic kernels: (2010) established an excess risk bound for ERM with local   m m Rademacher complexity. Mendelson (2003) presented sharp   Kfinite d K ,d ≥ , d , bounds on the local Rademacher complexity of the repro- = i θi i 0 i =1 (1) ducing kernel Hilbert space in terms of the eigenvalues of i=1 i=1 ,i ,...,m integral operator associated with kernel function. Based on where, Kθi =1 , are the basic kernels. In (Mic- the connection between local Rademacher complexity and chelli and Pontil 2005), they show that keeping the number the tail eigenvalues of integral operator, Kloft and Blanchard m of basic kernels fixed is an unnecessary restriction and (2012) derived generalization bounds for MKL. Unfortu- one can instead search over a possibly infinite set of basic nately, the eigenvalues of integral operator of kernel function kernels to improve the accuracy. Therefore, in this paper, we are difficult to compute, so Cortes, Kloft, and Mohri (2013) consider the general kernel classes: used the tail eigenvalues of kernel matrix, that is, the empiri- infinite cal version of the tail eigenvalues of integral operator, to de- K = Kθdp(θ):p ∈M(Ω) , (2) sign new kernel learning algorithms. However, the general- Ω ization bound based on the tail eigenvalues of kernel matrix where Kθ is a kernel function associated with the parameter was not established. Moreover, for different kinds of kernel θ ∈ Ω, Ω is a compact set and M(Ω) is the set of all prob- functions (or the same kind but with different parameters), ability measures on Ω. Note that Ω can be a continuously the discrepancies of eigenvalues of different kernels may be parameterized set of basic kernels. For example, Ω ⊂ R+   2 very large, hence the absolute value of the tail eigenvalues of and Kθ(x, x )=exp(−θx − x  ),orΩ=[1,c] and  T  θ infinite kernel function can not precisely reflect the goodness of dif- Kθ(x, x )=(1+x x ) .IfΩ=Nm, K corresponds ferent kernels. Liu and Liao (2015) first considered the rel- to the Kfinite. ative value of eigenvalues for kernel methods. In this paper, The generalization error (or risk) we consider another measure of the relative value of eigen- values, that is, the proportion of the sum of the first t largest R(f):= (f(x),y)dP (x,y) principal eigenvalues to that of the all eigenvalues, for ker- X×Y nel learning. Moreover, we derive sharper bounds with order associated with a hypothesis f is defined through a loss log(n) O( n ) using this relative value of eigenvalues of kernel function (f(x),y):Y×Y → [0,M],Mis a con- matrix. stant. In this paper, for classification,  is the hinge loss: (t, y)=max(0, 1 − yt); for regression,  is the -loss: Infinite Kernel Learning Lots of work focuses on the fi- (t, y)=max(0, |y − t|−). Since the probability distri- nite kernel learning, but few studies the infinite case. The bution P is unknown, R(f) cannot be explicitly computed, seminal work (Micchelli and Pontil 2005) generalized ker- thus we have to resort to its empirical estimator: nel learning to convex combination of an infinite number n 1  of kernels indexed by a compact set. Argyriou et al. (2006) Rˆ(f):= (f(xi),yi). n also proposed an efficient DC programming algorithm for i=1 learning kernels. Gehler and Nowozin (2008) reformulated κ K x, x < the optimization problem of (Argyriou et al. 2006), and pro- In the following, assume that =supx∈X θ( ) ∞ θ ∈ K θ posed an infinite kernel learning framework for solving it for any Ω, and θ is differentiable w.r.t . numerically. Ghiasi-Shirazi, Safabakhsh, and Shamsi (2010) considered the problem of optimizing a kernel function over Generalization Bounds the class of translation invariant kernels. Although using the In this section, we will introduce a novel measure of general- infinite kernels can improve the accuracy, the objective of ization errors, called principal eigenvalue proportion (PEP), the existing infinite kernel learning methods are all SVM ob- and use the measure to derive generalization bounds with ject or other related regularization functionals, which have fast convergence rates.

2281 n k log(n) ≈ Principal Eigenvalue Proportion (PEP) When is not very small, we know that k−1 = log(n)−1 Definition 1 (Principal Eigenvalue Proportion). Let K be 1,soR(f) − k Rˆ(f) ≈ R(f) − Rˆ(f). K x, x ∞ λ K φ x φ x k−1 a Mercer kernel and ( )= j=1 j( ) j( ) j( ) Note that the hinge loss and -loss are L-Lipschitz loss λ K ∞ be its eigenvalue decomposition, where ( j( ))j=1 is the functions with L =1, which shows that the assumption of sequence of eigenvalues of kernel function arranged in de- L-Lipschitz on loss function is reasonable. scending order. Then, the t-principal eigenvalue proportion Traditional Generalization Error Bounds Generaliza- (t-PEP) of K, t ∈ N+, is defined as tion bounds based on Rademacher complexity are standard t (Bartlett and Mendelson 2002). For any δ>0, f ∈H, with λi(K) β(K, t)= i=1 . probability 1 − δ, ∞ λ K  i=1 i( ) R(f) − Rˆ(f) ≤ Rn(H)/2+ ln(1/δ)/2n, One can see that the t-PEP is the proportion of the sum of R H H R H t where n( ) is Rademacher averages of . n( ) is in the first principal eigenvalues to that of all eigenvalues of a the order of O( √1 ) for various kernel classes used in prac- kernel function. Note that β(K, t) ∈ [0, 1], and it is invariant n O √1 with respect to scaling of K, that is β(K, t)=β(cK, t) tice. Thus, this bound converges at rate ( n ) at most. for any c =0 . In the next subsection, we will show that For radius-margin bound (Vapnik 2000) which is usually the larger the value β(K, t) is, the sharper the generalization adopted for MKL, with probability at least 1−δ, the follow- β K, t bound is. Thus, ( ) can be considered as a complexity ing inequality holds:  of K, and the larger value means the less complexity of K.   R2 log2 n K R f − Rˆ f ≤ c − δ /n. Definition 2 (Empirical PEP). Let be a kernel function ( ) ( ) 2 log 1 n ρ K K xi, xj and = n ( ) i,j=1 be its kernel matrix. Then, the   t K t ∈{ ,...,n− } log2 n -empirical PEP of , 1 1 , is defined as This bound also converges at rate O n . t  Bousquet and Elisseeff (2002) derived the stability-based βˆ(K, t)= λi(K) /tr(K), generalization bounds, and proved that if the algorithm has i=1 uniform stability η, then with 1 − δ,   where λi(K) is the i-th eigenvalue of K in descending order, 1 R(f) − Rˆ(f) ≤ 2η +(4n · η + M) log /2n. and tr(K) is the trace of K. δ O √1 The empirical PEP can considered as an empirical ver- The order of convergence rate is at most ( n ). sion of PEP, which can also be used to yield generalization The above theoretical analysis indicates that using the bounds. The empirical PEP can be computed from empirical PEP measure can obtain sharper bounds with fast conver- data, so we can use it for practical application. Note that the gence rates, which also demonstrates the effectiveness of the t O tn2 -empirical PEP can be computed in ( ) time for each use of PEP to estimate generalization error. t kernel because it is sufficient to compute the largest eigen- If Ω is a finite set, from Theorem 1, it is easy to prove that: values and the trace of K. Corollary 1. Assume that  is an L-Lipschitz loss function. ∀K ∈Kfinite Generalization Bounds with PEP , denote Hfinite { w, x w≤ } , Theorem 1. Let (·, ·) be an L-Lipschitz loss function asso- K = ΦK ( ) : Δ ciated with the first variable. For ∀ K ∈Kinfinte, denote where Kfinite is the finite kernel class defined in (1). Then, − δ ∀ k ≥ Hinfinite { w, x w≤ } , with probability 1 , the following inequality holds: K = ΦK ( ) : Δ 1 f ∈Hfinite and K , infinte m where K is the infinite kernel class defined in Equation k k di(1 − β(Kθ ,t)) k R(f) ≤ Rˆ(f)+c i=1 i + c , (2). Then, ∀δ>0, with probability at least 1 − δ over the k − 1 n 2 n n 1 choice of a sample S = {(xi,y)} drawn i.i.d according i=1 where c = 20Δ2L2κ, c =(4+11M)log 1 +20kL2t. to P , the following inequality holds: ∀ k>1 and ∀f ∈ 1 2 δ infinite This result is remarkable since the generalization bound HK ,  admits an explicit dependency on the combination coeffi- k k (1 − β(Kθ,t))dp(θ) k d −β K ,t R f ≤ Rˆ f c Ω c cients i. It is a weighted average of 1 ( θi ) with com- ( ) ( )+ 1 + 2 d β K ,t k − 1 n n bination weights i. Note that ( θi ) can be considered Kθ 1 − β(Kθ ,t) c 2L2κ c M 1 L2t as a complexity of i , the larger the value i where 1 = 20Δ , 2 =(4+11 )log δ +20 . K is, the more complex the θi is. This bound suggests that, According to the above theorem, one can see that, for any while some Kθ could have large complexities, they may t β K ,t p θ i , larger value of Ω ( θ )d ( ) leads to sharper general- not be detrimental to generalization if the corresponding to- ization bound. Moreover, if we set k = log(n), the conver- tal combination weight is relatively small. Thus, to guar- R f − k Rˆ f antee good generalization performance, it is reasonable to gence rate of ( ) k−1 ( ) is      learn a kernel function by minimizing Rˆ(f), w2 and − β K ,t p θ  n 1 ( θ )d ( ) 1 log( ) di(1 − β(Kθ ,t)) O k Ω + = O . i i . Moreover, we can see that the con- n n n log(n) vergence rate can reach O( n ) when setting k = log(n).

2282 Generalization Bounds with Empirical PEP and T 1 T T T Since it’s not easy to compute the value of PEP, we have to DR(d)=max1 Yα − α Kdα − λd βˆ(t) − 1 |α| use the empirical PEP for practical kernel learning. Fortu- α 2 nately, we can also use it to derive sharp bound. s.t. 1Tα =0, 0 ≤|α|≤C,

 L m Theorem 2. Assume is an -Lipschitz loss function. Then, where Kd = diKθ is the kernel matrix for a given − δ ∀ k ≥ f ∈Hinfinite i=1 i with probability 1 , 1 and K , d Y y ,...,y βˆ t βˆ t ,...,βˆ t T  , =diag( 1 n), and ( )=( 1( ) m( )) . T k k (1 − βˆ(Kθ,t))dp(θ) k Note that we can write P = E − λd βˆ(t) and D = R(f) ≤ Rˆ(f)+c Ω + c k − 1 3 n 4 n W − λdTβˆ(t) with strong duality holding between E and W . Thus, given any value of d, from Lemma 2 in (Chapelle c 2L2κ c M 1 L2t where 3 = 40Δ , 4 =(14+2 )log δ +40 . et al. 2002), we have One can see that the convergence rate can also reach ∂P ∂D ˆ 1 ∗ ∗ O log(n) k n = = −λβk(t) − α Hα , ( n ) when setting = log( ), which indicates the ef- ∂dk ∂dk 2 fectiveness of empirical PEP to estimate the generalization H YK Y H K where = θk for classification and = θk for re- error. gression, α∗ is the optimal solution of the dual optimization Corollary 2. Assume  is an L-Lipschitz loss function. Then, D. Thus, all we need to take a gradient step is to obtain α∗.If finite with 1 − δ, ∀ k ≥ 1, f ∈HK , we have d is fixed, P or D are equivalent to their corresponding sin- gle kernel with kernel matrix Kd. The PEP-based finite ker- k k m d − βˆ K ,t k i=1 i(1 ( θi )) nel learning algorithm (FKL) is summarized in Algorithm 1. R(f) ≤ Rˆ(f)+c3 + c4 k − 1 n n The step size η is chosen based on the Armijo rule to guaran- 2 2 1 2 tee convergence and the projection step, for the constraints where c3 = 40Δ L κ, c4 =(14+2M)log δ +40kL t. d ≥ 0 and d1 =1, is as simple as d ← max(0, d), d ← d . From the similar convergence analysis of (Rako- Kernel Learning Algorithms d1 tomamonjy et al. 2008), it is easy to verity that our FKL In this section, we will exploit the empirical PEP to devise algorithm is convergent. kernel learning algorithms for finite and infinite kernels. Finite Kernel Learning Algorithm 1 Finite Kernel Learning (FKL) {K }m d0 1 βˆ t This subsection studies the finite case. In this case, the finial 1: Initialize: Basic kernels θi i=1, i = m , i( )= t kernel Kd can be written as λ K / K i ,...,m s j=1 j( θi) tr( θi ), =1 , =0. m m 2: repeat m s K d K ,d ≥ , d , Kds d Kθ d = i θi i 0 i =1 3: = i=1 i i . i=1 i=1 4: Use an SVM solver to solve the single kernel problem s K s α K i ,...,m with d and obtain the optimal solution . where θi , =1 , are the basic kernels. According s+1 s 1 s s 5: d ← d + η(λβˆk(t)+ α Hα ), k =1,...,m. to the theoretical analysis of the above section, to guarantee k k 2 ds+1 ← , ds+1 ds+1 ds+1/ds+1 generalization performance, we can formulate the following 6: max(0 ), = 1. optimization: 7: s ← s +1. 8: until converged n 1 T min P(d)=min w w + C (f(xi),yi) d w,b 2 O tmn2 i=1 The time complexity of FKL algorithm is ( + 2 m kn + kA), where m is size of basic kernels, n is the size (3) A − λ diβˆi(t), of data set, is the time complexity of training a single ker- k i=1 nel problem, and is the number of iterations of FKL. In our experiments, we find that our FKL algorithm converges s.t. d ≥ 0, d =1, 1 very quickly. ˆ t t K C λ where βi( ) is the -empirical PEP of θi , and are two trade off parameters to control the balance between the em- Infinite Kernel Learning pirical loss and empirical PEP, f(x)= w, Φd(x) + b with In this subsection, we consider the infinite case. The objec-   Kd(x, x )= Φd(x), Φd(x ) . tive we are interested in is the best possible finite kernel To solve the optimization (3), we need to calculate the gra- learning object: dient ∇P(d). This can be achieved by moving to the dual n 1 T formulation of P(d) given by (for classification and regres- inf min w w + C (f(xi),yi) Ωfinite⊂Ω d,w,b 2 sion respectively) i=1 − λdTβˆ(t), (4) T 1 T T DC (d)=max1 α − α YKdYα − λd βˆ(t),  α 2 s.t. dθ =1,dθ ≥ 0, ∀θ ∈ Ωfinite, T s.t. 1 Yα =0, 0 ≤ α ≤ C, θ∈Ωfinite

2283 where Ωfinite is a finite set and Ω is a continuous parameter- where H = YKθY for classification, H = Kθ for regres- ized set. The inner minimization problem is a finite kernel sion. We want to use the gradient-based algorithm to search learning problem, from (Sonnenburg, Ratsch,¨ and Schafer¨ the maxima of Q(θ, α) over Ω. Thus, we should compute ˆ 2006; Sonnenburg et al. 2006), the dual problem of opti- ∂βθ (t) ˆ the ∂θ . However, βθ(t) is not differentiable. mization (4) can be written as In the following, we will show how to obtain an approxi- s s sup max h(α) − ξ, mation solution of βˆθ(t). For any θ ∈ Ω, let μj be the j-th ⊂ α,ξ Ωfinite Ω eigenvector of Kθs , j =1,...,t, then ξ ∈ R, ≤ q α ≤ C, (5) s.t. 0 ( i) t t sT s λj(Kθ) μ Kθμ Q θ α ≤ ξ,∀θ ∈ , j=1 j=1 j j s ( ; ) Ωfinite βˆθ(t)= ≥ =: β˜θ(t, θ ). tr(Kθ) tr(Kθ) T h α 1 α,q αi αi,Q θ α for classification, ( )= ( )= ( ; )= s 1 T T Thus, we can use arg max Q˜(θ; α)=λβ˜θ(t, θ )+ λβˆθ(t)+ α YKθYα; for regression, h(α)=1 Yα − θ∈Ω 2 1 T T 1 T α Hα to approximate the optimization (7). Note that 1 |α|,q(αi)=|αi|,Q(θ; α)=λβˆθ(t)+ α Kθα, 2 2 t s where βˆθ(t)= λj(Kθ)/tr(Kθ). Note that if some ∂Q˜(θ, α) ∂β˜θ(t, θ ) 1 ∂H j=1 = λ + αT α, (8) point (α∗,ξ∗) satisfies the last condition of the optimiza- ∂θ ∂θ 2 ∂θ θ ∈  ∂K tion (5) for all Ω, then it also satisfies the condition ∂β˜ t,θs t μsT θ μs β˜ t,θs ∂ K where θ ( ) = j=1 j ∂θ j − θ ( ) tr( θ ) . Thus, for all finite subsets thereof. Thus we omit the supermum ∂θ tr(Kθ ) tr(Kθ ) ∂θ of optimization (5) and extend the program to the following we can apply the gradient-based algorithm to solve the sub- semi-infinite program: problem, which is summarized in Algorithm 3.  O k tn2 sn2 A max h(α) − ξ, The time complexity of IFKL is ( + + ) , α,ξ where k is the number of iterations of IFKL (Algorithm 2), s A s.t. ξ ∈ R, 0 ≤ q(αi) ≤ C (6) is the number of iterations of Algorithm 3, and the time complexity of FKL (Algorithm 1). In our experiments, we Q(θ; α) ≤ ξ,∀θ ∈ Ω. find that IFKL converges very quickly, and k is smaller than The dual form (6) of the problem suggests a delayed con- 10 on all data sets. straint generation approach to solve it. We start with a finite constraint set Ω0 ⊂ Ω, and search for the optimal α∗. Then, Algorithm 3 Sub-Problem θ θ ← Q θ, α∗ , 0 we find the new , arg maxθ∈Ω ( ) to add it into 1: Initialize: Randomly select any θ ∈ Ω, s =0. the set of Ω0. The violated constraints are subsequently in- 2: repeat s s+1 s s ⊂ ⊂ μ ,...,μ K s cluded in Ω Ω Ω. The algorithm of PEP-based 3: Compute the eigenvectors 1 t of θ .  ˜ s  infinite kernel learning (IFKL) is summarized in Algorithm θs+1 ← θs η λ ∂βθ (t,θ ) 1 αT ∂H α 4: + ∂θ + 2 ∂θ (see Equa- 2. One can see that the basic kernels are selected automati- tion (8)). cally during the learning phase. 5: s ← s +1. 6: until converged Algorithm 2 Infinite Kernel Learning (IFKL) 1: Initialize: Continuous parameterized set Ω, randomly select θ0 ∈ Ω, Ω0 = ∅, s =0, α0 =0, ξ = −∞. s s 2: while Q(θ ; α ) >ξdo Experiments s s s 3: Ω +1 =Ω ∪{θ }. In this section, we will empirically compare our PEP- s 4: Using FKL (Algorithm 1) with Ω +1 as basic kernels based finite kernel learning (FKL) and infinite kernel learn- to obtain the optimal solution αs+1. ing (IFKL) with 9 popular finite and infinite kernel learn- s+1 5: ξ =maxθ∈Ωs+1 Q(θ, α ). ing methods: centered-alignment based MKL with lin- θs+1 ← Q θ, αs+1 { 6: arg maxθ∈Ω ( ) Sub-Problem, ear combination (CABMKL (linear)) and conic combina- solved in Algorithm 3}. tion (CABMKL (conic)) (Cortes, Mohri, and Rostamizadeh 7: s ← s +1 2010), SimpleMKL (Rakotomamonjy et al. 2008), gener- 8: end while alized MKL algorithm (GMKL) (Varma and Babu 2009), nonlinear MKL algorithm with L1-norm (NLMKL (p =1)) From Theorem 7.2 in (Hettich and Kortanek 1993), we and L2-norm (NLMKL (p =2)) (Cortes, Mohri, and Ros- know that if the sub-problem (line 6 in Algorithm 2) can be tamizadeh 2009), group Lasso-based MKL algorithms with solved, Algorithm 2 stops after a finite number of iterations L1-norm (GLMKL (p =1)) and L2-norm (GLMKL (p = or has at least one point of accumulation and each one of 2)) (Kloft et al. 2011), and the state-of-the-art infinite kernel these points solve optimization (6). learning (IKL) (Gehler and Nowozin 2008). To run Algorithm 2, we should solve the sub-problem The data sets are 10 publicly available data sets from LIB- (line 6 of Algorithm 2): SVM Data seen in Table 1. For finite kernels, we use the K x, x −τx − x2 Gaussian kernel ( )=exp 2 and poly- 1 T  T  d arg max Q(θ; α)=λβˆθ(t)+ α Hα, (7) nomial kernel K(x, x )=(1+x x ) as our basic kernels, θ∈Ω 2 τ ∈{2i,i = −10, −9 ...,10} and d ∈{1, 2,...,20}.For

2284 Table 1: The average test accuracies of our IFKL and FKL, and other ones including: CABMKL (linear), CABMKL (conic), SimpleMKL, GMKL, GLMKL (p =1), GLMKL (p =2), NLMKL (p =1), NLMKL (p =2) and IKL. We bold the numbers of the best method, and underline the numbers of the other methods which are not significantly worse than the best one. Datasets IFKL (ours) FKL (ours) CABMKL(linear) CABMKL(conic) SimpleMKL GMKL australian 85.87±0.88 85.85±0.84 84.41±0.86 85.69±0.92 85.39±0.92 85.40±0.93 a2a 81.14±0.07 80.87±0.08 80.12±0.09 74.42±0.07 80.53±0.07 80.56±0.07 diabetes 76.25±0.99 76.19±0.94 75.19±1.40 75.55±0.89 75.71±0.85 75.71±0.85 german.numer 74.88±1.08 74.32±1.31 71.94±1.69 73.77±1.38 73.99±1.47 73.99±1.48 heart 82.27±2.92 82.15±5.50 82.02±3.85 82.05±5.47 81.53±3.31 82.12±3.81 ionosphere 93.67±1.07 93.66±1.02 91.57±1.49 91.59±1.43 91.53±1.26 91.52±1.24 liver-disorders 71.25±3.83 70.52±5.95 68.50±6.08 69.04±6.32 66.92±6.64 70.52±5.95 sonar 84.90±6.08 82.53±5.35 81.99±5.94 81.99±5.94 82.15±5.09 82.15±5.09 splice 85.48±0.76 85.46±0.82 84.20±0.90 84.20±0.90 84.31±0.85 84.55±0.81 svmguide3 82.96±0.71 82.83±0.72 80.76±0.78 80.34±0.77 82.81±0.71 82.81±0.71 Datasets NLMKL(p=1) NLMKL(p=2) GLMKL(p=1) GLMKL(p=2) IKL australian 85.35±0.90 84.70±0.94 85.46±0.88 85.39±0.92 85.78±0.85 a2a 79.16±0.07 79.07±0.08 80.71±0.07 80.82±0.08 80.84±0.08 diabetes 74.57±0.80 73.94±1.00 75.81±0.84 75.77±0.78 76.02±1.00 german.numer 73.55±1.04 72.45±1.22 74.07±1.53 73.77±1.38 74.83±1.02 heart 81.51±3.00 80.17±5.45 82.10±3.63 82.07±3.69 82.12±3.81 ionosphere 91.53±1.66 91.65±1.54 91.50±1.37 91.08±0.98 93.63±1.05 liver-disorders 68.61±5.47 68.55±4.30 64.57±3.43 66.97±6.24 70.58±5.29 sonar 81.25±6.44 81.89±6.27 82.24±5.13 81.86±6.43 83.04±6.50 splice 83.45±1.14 83.69±0.99 85.40±0.79 53.29±1.05 85.46±0.82 svmguide3 82.19±0.71 82.53±0.68 82.75±0.72 80.23±1.29 82.83±0.72 infinite kernels. we use Gaussian kernel and polynomial ker- manifests the effectiveness of the use of infinite kernels. The nel with continuously parameterized sets, τ ∈ [2−10, 210] above results show that the use of PEP can significantly im- and d ∈ [1, 20]. The regularization parameter C of all al- prove the performance of kernel learning algorithms in both gorithms is set to be 1. The other parameters for the com- finite and infinite kernels. pared algorithms follow the same experimental setting in their papers. The parameters λ ∈{2i,i = −5,...,5} and Conclusion t ∈{2i,i =1,...,4} of our algorithms are determined by 3-fold cross-validation on training set. For each data set, we In this paper, we have created a new kernel learning method run all methods 30 times with randomly selected 50% of all based on the principal eigenvalue proportion (PEP) measure data for training and the other 50% for testing. We use t-test of generalization errors, which can learn the optimal kernel to describe the statistical discrepancy of different methods. with fast convergence rate over a convex hull of possibly in- All statements of statistical significance in the remainder re- finite basic kernels. Using the PEP measure, we have derived fer to a 95% level of significance. sharper generalization bounds with fast convergence rates, which improve the results of generalization bounds for most The average test accuracies are reported in Table 1, which of kernel learning algorithms. Exploiting the derived gen- can be summarized as follows: 1) Our IFKL gives the best eralization bounds, we have designed two effective kernel results on all data sets and is significantly better than the learning algorithms with statistical guarantees and fast con- compared finite kernel selection methods (CABMKL, Sim- vergence rates, which outperform the state-of-the-art kernel pleMKL, GMKL and NLMKL) on nearly all data sets. Thus, learning methods. it implicates that learning the kernel with PEP over a pre- scribed set of continuously parameterized basic kernels can Acknowledgments guarantee good generalization performance. 2) IFKL is sig- This work was partially supported by the National Natural nificantly better than the state-of-the-art infinite kernel learn- Science Foundation of China under grant No. 61602467, ing (IKL) on 4 out of 10 data sets, and FKL is significantly No. 61303056 and Youth Innovation Promotion Association, better than the compared finite kernel learning methods CAS, No.2016146. (CABMKL, SimpleMKL, GMKL and NLMKL) on most of data sets; 3) FKL gives comparable results to the IFKL; 4) The infinite kernel learning methods (IFKS and IKL) are References usually better than the compared finite kernel learning meth- Anguita, D.; Ghio, A.; Oneto, L.; and Ridella, S. 2012. In- ods (CABMKL, SimpleMKL, GMKL and NLMKL), which sample and out of sample model selection and error esti-

2285 mation for support vector machines. IEEE Transactions on Koltchinskii, V., and Panchenko, D. 2000. Rademacher pro- Neural Networks and Learning Systems 23(9):1390–1406. cesses and bounding the risk of function learning. Springer. Argyriou, A.; Hauser, R.; Micchelli, C. A.; and Pontil, M. Koltchinskii, V. 2006. Local Rademacher complexities 2006. A DC-programming algorithm for kernel selection. and oracle inequalities in risk minimization. The Annals of In Proceedings of the 23rd international conference on Ma- Statistics 34(6):2593–2656. chine learning (ICML 2006), 41–48. ACM. Lanckriet, G. R. G.; Cristianini, N.; Bartlett, P. L.; Ghaoui, Argyriou, A.; Micchelli, C. A.; and Pontil, M. 2005. Learn- L. E.; and Jordan, M. I. 2004. Learning the kernel matrix ing convex combinations of continuously parameterized ba- with semidefinite programming. Journal of Machine Learn- sic kernels. In Proceeding of the 18th Annual Conference on ing Research 5:27–72. Learning Theory (COLT 2005). Springer. 338–352. Liu, Y.,and Liao, S. 2015. Eigenvalues ratio for kernel selec- Bach, F.; Lanckriet, G.; and Jordan, M. 2004. Multiple ker- tion of kernel methods. In Proceedings of the Twenty-Ninth nel learning, conic duality, and the SMO algorithm. In Pro- AAAI Conference on Artificial Intelligence (AAAI 2015). ceedings of the 21st International Conference on Machine Liu, Y.; Jiang, S.; and Liao, S. 2014. Efficient approxima- Learning (ICML 2004), 41–48. tion of cross-validation for kernel methods using Bouligand Bartlett, P. L., and Mendelson, S. 2002. Rademacher and influence function. In Proceedings of The 31st International Gaussian complexities: risk bounds and structural results. Conference on Machine Learning (ICML 2014 (1)), 324– Journal of Machine Learning Research 3:463–482. 332. Bartlett, P. L.; Bousquet, O.; and Mendelson, S. 2005. Liu, Y.; Liao, S.; and Hou, Y. 2011. Learning kernels Local Rademacher complexities. The Annals of Statistics with upper bounds of leave-one-out error. In Proceedings 33(4):1497–1537. of the 20th ACM International Conference on Information Bousquet, O., and Elisseeff, A. 2002. Stability and general- and Knowledge Management (CIKM 2011), 2205–2208. ization. Journal of Machine Learning Research 2:499–526. Lugosi, G., and Wegkamp, M. 2004. Complexity regulariza- Chapelle, O.; Vapnik, V.; Bousquet, O.; and Mukherjee, S. tion via localized random penalties. The Annals of Statistics 2002. Choosing multiple parameters for support vector ma- 32:1679–1697. chines. Machine Learning 46(1-3):131–159. Mendelson, S. 2003. On the performance of kernel classes. Cortes, C.; Kloft, M.; and Mohri, M. 2013. Learning kernels Journal of Machine Learning Research 4:759–771. using local Rademacher complexity. In Advances in Neural Micchelli, C. A., and Pontil, M. 2005. Learning the kernel Information Processing Systems 25 (NIPS 2013). MIT Press. function via regularization. Journal of Machine Learning 2760–2768. Research 6:1099–1125. Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2009. Learn- Ong, C. S.; Smola, A. J.; and Williamson, R. C. 2005. Learn- ing non-linear combinations of kernels. In Advances in Neu- ing the kernel with hyperkernels. Journal of Machine Learn- ral Information Processing Systems 22 (NIPS 2009), 396– ing Research 6:1043–1071. 404. Rakotomamonjy, A.; Bach, F.; Canu, S.; and Grandvalet, Y. Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2010. Two- 2008. SimpleMKL. Journal of Machine Learning Research stage learning kernel algorithms. In Proceedings of the 27th 9:2491–2521. Conference on Machine Learning (ICML 2010), 239–246. Sonnenburg, S.; Ratsch,¨ G.; Schafer,¨ C.; and Scholkopf,¨ B. Gehler, P., and Nowozin, S. 2008. Infinite kernel learning. 2006. Large scale multiple kernel learning. Journal of Ma- In NIPS Workshop on Kernel Learning: Automatic Selection chine Learning Research 7:1531–1565. of Optimal Kernels. Sonnenburg, S.; Ratsch,¨ G.; and Schafer,¨ C. 2006. A gen- Ghiasi-Shirazi, K.; Safabakhsh, R.; and Shamsi, M. 2010. eral and efficient multiple kernel learning algorithm. In Ad- Learning translation invariant kernels for classification. vances in Neural Information Processing Systems 18 (NIPS Journal of Machine Learning Research 11:1353–1390. 2006). Hettich, R., and Kortanek, K. O. 1993. Semi-infinite pro- Srebro, N.; Sridharan, K.; and Tewari, A. 2010. Smoothness, gramming: theory, methods, and applications. SIAM review low noise and fast rates. In Advances in Neural Information 35(3):380–429. Processing Systems 22. MIT Press. 2199–2207. Kloft, M., and Blanchard, G. 2012. On the convergence rate Vapnik, V. 2000. The nature of statistical learning theory. of p-norm multiple kernel learning. Journal of Machine Springer Verlag. Learning Research 13(1):2465–2502. Varma, M., and Babu, B. R. 2009. More generality in Kloft, M.; Brefeld, U.; Sonnenburg, S.; Laskov, P.; Muller,¨ efficient multiple kernel learning. In Proceedings of the K.-R.; and Zien, A. 2009. Efficient and accurate p-norm 26th Annual International Conference on Machine Learn- multiple kernel learning. In Advances in Neural Information ing (ICML 2009), 134–141. Processing Systems 22 (NIPS 2009), 997–1005. Zhang, T. 2002. Covering number bounds of certain regu- Kloft, M.; Brefeld, U.; Sonnenburg, S.; and Zien, A. 2011. larized linear function classes. Journal of Machine Learning p-norm multiple kernel learning. Journal of Machine Research 2:527–550. Learning Research 12:953–997.

2286