Infinite Kernel Learning

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Infinite Kernel Learning: Generalization Bounds and Algorithms Yong Liu,1 Shizhong Liao,2 Hailun Lin,1 Yinliang Yue,1∗ Weiping Wang1 1Institute of Information Engineering, CAS 2School of Computer Science and Technology, Tianjin University {liuyong,linhailun,yueyinliang,wangweiping}@iie.ac.cn, {szliao,yongliu}@tju.edu.cn Abstract these measures usually have slow convergence rates with or- O √1 n der at most n , where is the size of data set. Kernel learning is a fundamental problem both in recent research and application of kernel methods. Exist- Instead of learning a single kernel, multiple kernel learning kernel learning methods commonly use some mea- ing (MKL) follows a different route to learn a set of sures of generalization errors to learn the optimal ker- combination coefficients of basic kernels (Lanckriet et al. nel in a convex (or conic) combination of prescribed 2004; Bach, Lanckriet, and Jordan 2004; Ong, Smola, and basic kernels. However, the generalization bounds de- Williamson 2005; Sonnenburg et al. 2006; Rakotomamonjy rived by these measures usually have slow convergence et al. 2008; Kloft et al. 2009; 2011; Cortes, Mohri, and rates, and the basic kernels are finite and should be spec- Rostamizadeh 2010; Cortes, Kloft, and Mohri 2013; Liu, ified in advance. In this paper, we propose a new ker- Liao, and Hou 2011). Within this framework, the final kernel learning method based on a novel measure of gen- nel is usually a convex (or conic) combination of finite ba- eralization error, called principal eigenvalue proportion sic kernels that should be specified in advance by users. (PEP), which can learn the optimal kernel with sharp generalization bounds over the convex hull of a possibly To improve the accuracy of MKL, some researchers stud- infinite set of basic kernels. We first derive sharp gener- ied the problem of learning a kernel in the convex hull of a alization bounds based on the PEP measure. Then we prescribed set of continuously parameterized basic kernels design two kernel learning algorithms for finite kernels (Micchelli and Pontil 2005; Argyriou, Micchelli, and Pon- and infinite kernels respectively, in which the derived til 2005; Argyriou et al. 2006; Gehler and Nowozin 2008; sharp generalization bounds are exploited to guarantee Ghiasi-Shirazi, Safabakhsh, and Shamsi 2010). This flexibil- faster convergence rates, moreover, basic kernels can be ity of the kernel class can translate into significant improve- learned automatically for infinite kernel learning instead ments in terms of accuracy. However, the measures used for of being prescribed in advance. Theoretical analysis and the existing finite and infinite MKL algorithms are usually empirical results demonstrate that the proposed kernel radius-margin (SVM objective) or other related regulariza- learning method outperforms the state-of-the-art kernel learning methods. tion functionals, which usually have slow convergence rate. In this paper, we introduce a novel measure of generalization errors based on spectral analysis, called principal Introduction eigenvalue proportion (PEP), and create a new kernel learning method over a possibly infinite set of basic kernels. We Kernel methods have been successfully applied in solving first derive generalization bounds with convergence rates of various problems in machine learning community. The per- n O( log( ) ) formance of these methods strongly depends on the choice order n based on the PEP measure. By minimizing of kernel functions (Micchelli and Pontil 2005). The earli- the derived PEP-based sharp generalization bounds, we de- est learning method of a kernel function is cross-validation, sign two new kernel learning algorithms for finite kernel and which is computationally expensive and only applicable to infinite kernels respectively: one can be formulated as a con- kernels with a small number of parameters. Minimizing the- vex optimization and the other one as a semi-infinite pro- oretical estimate bounds of generalization error is an alterna- gram. For infinite case, the basic kernels are learned auto- tive to cross-validation (Liu, Jiang, and Liao 2014). To this matically instead of being specified in advance. Experimen- end, some measures are introduced: such as VC dimension tal results on lots of benchmark data sets show that our pro- (Vapnik 2000), covering number (Zhang 2002), Rademacher posed method can significantly outperform the existing ker- complexity (Bartlett and Mendelson 2002), radius-margin nel learning methods. Theoretical analysis and experimen- (Vapnik 2000), maximal discrepancy (Anguita et al. 2012), tal results demonstrate that our PEP-based kernel learning etc. Unfortunately, the generalization bounds derived by method is sound and effective. ∗ Related Work Corresponding author Copyright c 2017, Association for the Advancement of Artificial In this subsection, we introduce the related measures of gen- Intelligence (www.aaai.org). All rights reserved. eralization error and infinite kernel learning methods. 2280 Measures of Generalization Error In recent years, lo- slow convergence rates and usually only applicable for clas- cal Rademacher complexities were used to derive sharper sification. In this paper, we propose an infinite kernel learn- generalization bounds. Koltchinskii and Panchenko (2000) ing method, which use the PEP-based measure with fast con- first applied the notion of the local Rademacher complex- vergence rate and can be applicable both for classification ity to obtain data dependent upper bounds using an iterative and regression. method. Lugosi and Wegkamp (2004) established the oracle inequalities using this notion and demonstrated its advan- Notations and Preliminaries tages over those based on the complexity of the whole model We consider supervised learning with a sample S = class. Bartlett, Bousquet, and Mendelson (2005) derived { x ,y }n n ( i i) i=1 of size drawn i.i.d. from a fixed, but un- generalization bounds based on a local and empirical version known probability distribution P on X×Y, where X de- of Rademacher complexity, and further presented some ap- notes the input space and Y denotes the output domain, plications to classification and prediction with convex func- Y = {−1, +1} for classification, Y⊆R for regression. tion classes. Koltchinskii (2006) proposed new bounds in Let K : X×X→ R be a kernel function, and K = 1 n terms of the local Rademacher complexity, and applied these K(xi, xj) be its corresponding kernel matrix. For bounds to develop model selection techniques in abstract n i,j=1 most of MKL algorithms, the kernel is learned over a convex risk minimization problems. Srebro, Sridharan, and Tewari combination of finite basic kernels: (2010) established an excess risk bound for ERM with local m m Rademacher complexity. Mendelson (2003) presented sharp Kfinite d K ,d ≥ , d , bounds on the local Rademacher complexity of the repro- = i θi i 0 i =1 (1) ducing kernel Hilbert space in terms of the eigenvalues of i=1 i=1 ,i ,...,m integral operator associated with kernel function. Based on where, Kθi =1 , are the basic kernels. In (Mic- the connection between local Rademacher complexity and chelli and Pontil 2005), they show that keeping the number the tail eigenvalues of integral operator, Kloft and Blanchard m of basic kernels fixed is an unnecessary restriction and (2012) derived generalization bounds for MKL. Unfortu- one can instead search over a possibly infinite set of basic nately, the eigenvalues of integral operator of kernel function kernels to improve the accuracy. Therefore, in this paper, we are difficult to compute, so Cortes, Kloft, and Mohri (2013) consider the general kernel classes: used the tail eigenvalues of kernel matrix, that is, the empiri- infinite cal version of the tail eigenvalues of integral operator, to de- K = Kθdp(θ):p ∈M(Ω) , (2) sign new kernel learning algorithms. However, the general- Ω ization bound based on the tail eigenvalues of kernel matrix where Kθ is a kernel function associated with the parameter was not established. Moreover, for different kinds of kernel θ ∈ Ω, Ω is a compact set and M(Ω) is the set of all prob- functions (or the same kind but with different parameters), ability measures on Ω. Note that Ω can be a continuously the discrepancies of eigenvalues of different kernels may be parameterized set of basic kernels. For example, Ω ⊂ R+ 2 very large, hence the absolute value of the tail eigenvalues of and Kθ(x, x )=exp(−θx − x ),orΩ=[1,c] and T θ infinite kernel function can not precisely reflect the goodness of dif- Kθ(x, x )=(1+x x ) .IfΩ=Nm, K corresponds ferent kernels. Liu and Liao (2015) first considered the rel- to the Kfinite. ative value of eigenvalues for kernel methods. In this paper, The generalization error (or risk) we consider another measure of the relative value of eigenvalues, that is, the proportion of the sum of the first t largest R(f):= (f(x),y)dP (x,y) principal eigenvalues to that of the all eigenvalues, for ker- X×Y nel learning. Moreover, we derive sharper bounds with order associated with a hypothesis f is defined through a loss log(n) O( n ) using this relative value of eigenvalues of kernel function (f(x),y):Y×Y → [0,M],Mis a con- matrix. stant. In this paper, for classification, is the hinge loss: (t, y)=max(0, 1 − yt); for regression, is the -loss: Infinite Kernel Learning Lots of work focuses on the fi- (t, y)=max(0, |y − t|−). Since the probability distri- nite kernel learning, but few studies the infinite case. The bution P is unknown, R(f) cannot be explicitly computed, seminal work (Micchelli and Pontil 2005) generalized ker- thus we have to resort to its empirical estimator: nel learning to convex combination of an infinite number n 1 of kernels indexed by a compact set.

Infinite Kernel Learning

Engineering Applications of Artificial Intelligence Potential, Challenges and Future Directions for Deep Learning in Prognostics

Implicit Kernel Learning

Exploring Induced Pedagogical Strategies Through a Markov Decision Process Frame- Work: Lessons Learned

Multiple-Kernel Learning for Genomic Data Mining and Prediction

Clustering on Optimized Features

Fredholm Multiple Kernel Learning for Semi-Supervised Domain Adaptation

Study of Kernel Machines Towards Brain-Computer Interfaces Tian Xilan

Multiclass Multiple Kernel Learning

Bridging Deep and Kernel Methods

A Multiple Kernel Density Clustering Algorithm for Incomplete Datasets in Bioinformatics Longlong Liao1,2, Kenli Li3*, Keqin Li4, Canqun Yang1,2 and Qi Tian5

LMKL-Net: a Fast Localized Multiple Kernel Learning Solver Via Deep Neural Networks

Large Scale Multiple Kernel Learning