Two-Layer Multiple Kernel Learning

Two-Layer Multiple Kernel Learning Jinfeng Zhuang Ivor W. Tsang Steven C.H. Hoi School of Computer Engineering School of Computer Engineering School of Computer Engineering Nanyang Technological University Nanyang Technological University Nanyang Technological University Singapore Singapore Singapore [email protected] [email protected] [email protected] Abstract based machine learning algorithms have been extensively studied over the past decade [23]. Some well- known examples include Support Vector Machines Multiple Kernel Learning (MKL) aims to (SVM) [9, 27], Kernel Logistic Regression [32], and learn kernel machines for solving a real Kernel PCA for denoising [20, 17], etc. These kernel machine learning problem (e.g. classifica- methods have been successfully applied to a variety of tion) by exploring the combinations of mul- real applications and often observed promising perfor- tiple kernels. The traditional MKL ap- mance. proach is in general \shallow" in the sense that the target kernel is simply a linear The most crucial element of a kernel method is ker- (or convex) combination of some base kernel, which is in general a function that defines an in- nels. In this paper, we investigate a frame- ner product between any two examples in some in- work of Multi-Layer Multiple Kernel Learn- duced Hilbert space [10, 23, 15]. By mapping data ing (MLMKL) that aims to learn \deep" ker- from an input space to the reproducing kernel Hilbert nel machines by exploring the combinations space (RKHS) [10], which could be potentially high- of multiple kernels in a multi-layer structure, dimensional, traditional linear methods can be ex- which goes beyond the conventional MKL tended with reasonable effort to yield considerably bet- approach. Through a multiple layer map- ter performance. Many empirical studies have shown ping, the proposed MLMKL framework of- that the choice of kernel often affects the resulting per- fers higher flexibility than the regular MKL formance of kernel methods significantly. In fact, in- for finding the optimal kernel for applica- appropriate kernels can result in sub-optimal or very tions. As the first attempt to this new MKL poor performance. framework, we present a Two-Layer Multiple For many real-world situations, it is often not an easy Kernel Learning (2LMKL) method together task to choose an appropriate kernel function, which with two efficient algorithms for classification usually may require some domain knowledge that tasks. We analyze their generalization perfor- would be difficult for non-expert users. To address mances and have conducted an extensive set such limitation, recent years have witnessed the ac- of experiments over 16 benchmark datasets, tive research of learning effective kernels automatically in which encouraging results showed that our from data [4]. One popular example technique for ker- method performed better than the conven- nel learning is Multiple Kernel Learning (MKL) [4, 24], tional MKL methods. which aims at learning a linear (or convex) combination of a set of predefined kernels in order to identify a good target kernel for the applications. Comparing 1 Introduction with traditional kernel methods using a single fixed kernel, MKL does exhibit its strength of automated Kernel learning is one of the active research topics in kernel parameter tuning and capability of concatenat- machine learning community. The family of kernel ing heterogeneous data. Over the past few years, MKL has been actively investigated, in which a number of Appearing in Proceedings of the 14th International Con- algorithms have been proposed to resolve the efficiency ference on Artificial Intelligence and Statistics (AISTATS) of MKL [4, 25, 21, 29], and a number of extended MKL 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: techniques have been proposed to improve the regular W&CP 15. Copyright 2011 by the authors. 909 Two-Layer Multiple Kernel Learning linear MKL method [11, 12, 3, 28, 16, 8]. 2 Preliminaries Despite being studied actively, unfortunately existing In this Section, we introduce some preliminaries of MKL methods do not always produce considerably MKL and some emerging studies on deep learning for better empirical performance when comparing with a kernel methods. single kernel whose parameters were tuned by cross validation [11]. There are several possible reasons to 2.1 Multiple Kernel Learning account for such failure. One conjecture is that the target kernel domain K of MKL using a linear com- n Consider a collection of n training samples X1 = d bination may not be rich enough to contain the opti- f(x1; y1);:::; (xn; yn)g, where xi 2 R is the input fea- mal kernel. Therefore, some emerging study has at- ture vector and yi is the class label of xi. In general, tempted to improve the regular MKL by explore more the problem of conventional multiple kernel learning general kernel domain K using some nonlinear combi- (MKL) can be formulated into the following optimiza- nation [28]. Following the similar motivation, we spec- tion scheme [4]: ulate that the insignificant performance gain is proba- bly due to the shallow learning nature of regular MKL Xn min 2K min 2H λkfkH + `(y f(x )); (1) that simply adopts a flat (linear/nonlinear) combina- k f k k i i tion of multiple kernels. i=1 To this end, this paper presents a novel framework where `(·) denotes some loss function, e.g. the hinge − H of Multi-Layer Multiple Kernel Learning (MLMKL), loss `(t) = max(0; 1 t) used for SVM, k is the repro- which applies the idea of deep learning to improve ducing kernel Hilbert space associated with kernel k, K the MKL task. Deep architecture has been being denotes the optimization domain of the candidate ker- actively studied in machine learning community and nels, and λ is a regularization parameter. The above has shown promising performance for some applica- optimization aims to simultaneously identify both the tions [13, 14, 19, 5]. Our study was partially inspired optimal kernel k from domain K and the optimal pre- by the recent work [6] that first explored kernel meth- diction function f from the reproducing kernel Hilbert H ods with the idea of deep learning. However, unlike space k induced by the optimal kernel k. If the op- the previous work, our study in this paper mainly timal kernel k is given a prior, the above formulation aims to address the challenge of improving the exist- is essentially reduced to the kernel SVM. ing MKL techniques with deep learning. Specifically, By the representer theorem [22], the decision function we introduce a multilayer architecture for MLMKL, in f(x) for the above formulation is in form of a linear which all base kernels in antecedent-layers are com- expansion of kernel evaluation on the training samples bined to form some inputs to other kernels in subse- xi's, quent layers. We also provide an efficient alternating optimization algorithm to learn the decision function Xn and the weight of base kernels simultaneously. To fur- f(x) = βik(xi; x); (2) ther minimize the requirement of domain knowledge to i=1 design the choices and the numbers of base kernels in where βi's are the coefficients. each antecedent-layers, we also present an infinite base kernel learning algorithm for our proposed MLMKL In traditional MKL [18], K is chosen to be a set of a framework. convex combination of predefined base kernels: The rest of this paper is organized as follows. Section 2 n Xm K · · · · gives some preliminaries of multiple kernel learning conv = k( ; ) = µtkt( ; ): and deep learning. Section 3 first presents the frame- t=1 m o work of MLMKL and then proposes a Two-Layer MKL X µ = 1; µ ≥ 0; t = 1; : : : ; m ; (3) method, followed by the development of two efficient t t algorithms and the analysis of their generalization per- t=1 formance. Section 4 discusses an extensive set of ex- where each candidate kernel k is some combination of periments for performance evaluation over a testbed the m base kernels fk1; : : : ; kmg, and µi is the coeffi- with 16 publicly available benchmark data sets. Sec- cient of the ith base kernel. From (3), we can expand tion 5 concludes this paper. the decision function in (2) with the multiple kernels: Xn Xm Xn Xm f(x) = βi µtkt(xi; x) = βiµtkt(xi; x); i=1 t=1 i=1 t=1 (4) 910 Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi where the final kernel is a linear combination of m base (where degree n and level l parameters were chosen kernels. manually). No solution has been provided to optimize the kernel by learning the optimal parameters auto- Although it has been studied extensively, similar to matically. Our work was partially motivated by this SVM [27], the traditional MKL approach fall shorts work to address the above limitations. in that the resulting kernel machine is \shallow" since it often adopts a simple conic combination of multiple kernels and train the classifier with the combined 3 Multi-Layer Multiple Kernel kernel in a “flat” architecture, which may not be pow- Learning erful enough to fit diverse patterns in some real-world complicated tasks. In this Section, we first introduce a general framework of Multi-Layer Multiple Kernel Learning (MLMKL), 2.2 Deep Learning and Multilayer Kernels and then present an MLMKL paradigm, i.e., Two- Layer Multiple Kernel Learning. Recently, a lot of machine learning studies have ad- dressed one limitation of conventional learning tech- 3.1 Framework niques (such as SVM) regarding their shallow learning architectures. It has been shown that the deep archi- Following the optimization framework of MKL, the ba- tecture, such as multilayer neural nets, is often more sic idea of MLMKL is to relax the optimization do- preferable over the shallow ones.

Load more