Two-Layer Multiple Kernel Learning

Jinfeng Zhuang Ivor W. Tsang Steven C.H. Hoi School of Computer Engineering School of Computer Engineering School of Computer Engineering Nanyang Technological University Nanyang Technological University Nanyang Technological University Singapore Singapore Singapore [email protected] [email protected] [email protected]

Abstract based algorithms have been exten- sively studied over the past decade [23]. Some well- known examples include Support Vector Machines Multiple Kernel Learning (MKL) aims to (SVM) [9, 27], Kernel [32], and learn kernel machines for solving a real Kernel PCA for denoising [20, 17], etc. These kernel machine learning problem (e.g. classifica- methods have been successfully applied to a variety of tion) by exploring the combinations of mul- real applications and often observed promising perfor- tiple kernels. The traditional MKL ap- mance. proach is in general “shallow” in the sense that the target kernel is simply a linear The most crucial element of a is ker- (or convex) combination of some base ker- nel, which is in general a function that defines an in- nels. In this paper, we investigate a frame- ner product between any two examples in some in- work of Multi-Layer Multiple Kernel Learn- duced Hilbert space [10, 23, 15]. By mapping data ing (MLMKL) that aims to learn “deep” ker- from an input space to the reproducing kernel Hilbert nel machines by exploring the combinations space (RKHS) [10], which could be potentially high- of multiple kernels in a multi-layer structure, dimensional, traditional linear methods can be ex- which goes beyond the conventional MKL tended with reasonable effort to yield considerably bet- approach. Through a multiple layer map- ter performance. Many empirical studies have shown ping, the proposed MLMKL framework of- that the choice of kernel often affects the resulting per- fers higher flexibility than the regular MKL formance of kernel methods significantly. In fact, in- for finding the optimal kernel for applica- appropriate kernels can result in sub-optimal or very tions. As the first attempt to this new MKL poor performance. framework, we present a Two-Layer Multiple For many real-world situations, it is often not an easy Kernel Learning (2LMKL) method together task to choose an appropriate kernel function, which with two efficient algorithms for classification usually may require some domain knowledge that tasks. We analyze their generalization perfor- would be difficult for non-expert users. To address mances and have conducted an extensive set such limitation, recent years have witnessed the ac- of experiments over 16 benchmark datasets, tive research of learning effective kernels automatically in which encouraging results showed that our from data [4]. One popular example technique for ker- method performed better than the conven- nel learning is Multiple Kernel Learning (MKL) [4, 24], tional MKL methods. which aims at learning a linear (or convex) combina- tion of a set of predefined kernels in order to identify a good target kernel for the applications. Comparing 1 Introduction with traditional kernel methods using a single fixed kernel, MKL does exhibit its strength of automated Kernel learning is one of the active research topics in kernel parameter tuning and capability of concatenat- machine learning community. The family of kernel ing heterogeneous data. Over the past few years, MKL has been actively investigated, in which a number of Appearing in Proceedings of the 14th International Con- algorithms have been proposed to resolve the efficiency ference on Artificial Intelligence and Statistics (AISTATS) of MKL [4, 25, 21, 29], and a number of extended MKL 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: techniques have been proposed to improve the regular W&CP 15. Copyright 2011 by the authors.

909 Two-Layer Multiple Kernel Learning linear MKL method [11, 12, 3, 28, 16, 8]. 2 Preliminaries

Despite being studied actively, unfortunately existing In this Section, we introduce some preliminaries of MKL methods do not always produce considerably MKL and some emerging studies on for better empirical performance when comparing with a kernel methods. single kernel whose parameters were tuned by cross validation [11]. There are several possible reasons to 2.1 Multiple Kernel Learning account for such failure. One conjecture is that the target kernel domain K of MKL using a linear com- n Consider a collection of n training samples X1 = d bination may not be rich enough to contain the opti- {(x1, y1),..., (xn, yn)}, where xi ∈ R is the input fea- mal kernel. Therefore, some emerging study has at- ture vector and yi is the class label of xi. In general, tempted to improve the regular MKL by explore more the problem of conventional multiple kernel learning general kernel domain K using some nonlinear combi- (MKL) can be formulated into the following optimiza- nation [28]. Following the similar motivation, we spec- tion scheme [4]: ulate that the insignificant performance gain is proba- bly due to the shallow learning nature of regular MKL ∑n min ∈K min ∈H λ∥f∥H + ℓ(y f(x )), (1) that simply adopts a flat (linear/nonlinear) combina- k f k k i i tion of multiple kernels. i=1 To this end, this paper presents a novel framework where ℓ(·) denotes some loss function, e.g. the hinge − H of Multi-Layer Multiple Kernel Learning (MLMKL), loss ℓ(t) = max(0, 1 t) used for SVM, k is the repro- which applies the idea of deep learning to improve ducing kernel Hilbert space associated with kernel k, K the MKL task. Deep architecture has been being denotes the optimization domain of the candidate ker- actively studied in machine learning community and nels, and λ is a regularization parameter. The above has shown promising performance for some applica- optimization aims to simultaneously identify both the tions [13, 14, 19, 5]. Our study was partially inspired optimal kernel k from domain K and the optimal pre- by the recent work [6] that first explored kernel meth- diction function f from the reproducing kernel Hilbert H ods with the idea of deep learning. However, unlike space k induced by the optimal kernel k. If the op- the previous work, our study in this paper mainly timal kernel k is given a prior, the above formulation aims to address the challenge of improving the exist- is essentially reduced to the kernel SVM. ing MKL techniques with deep learning. Specifically, By the representer theorem [22], the decision function we introduce a multilayer architecture for MLMKL, in f(x) for the above formulation is in form of a linear which all base kernels in antecedent-layers are com- expansion of kernel evaluation on the training samples bined to form some inputs to other kernels in subse- xi’s, quent layers. We also provide an efficient alternating optimization algorithm to learn the decision function ∑n and the weight of base kernels simultaneously. To fur- f(x) = βik(xi, x), (2) ther minimize the requirement of domain knowledge to i=1 design the choices and the numbers of base kernels in where βi’s are the coefficients. each antecedent-layers, we also present an infinite base kernel learning algorithm for our proposed MLMKL In traditional MKL [18], K is chosen to be a set of a framework. convex combination of predefined base kernels: The rest of this paper is organized as follows. Section 2 { ∑m K · · · · gives some preliminaries of multiple kernel learning conv = k( , ) = µtkt( , ): and deep learning. Section 3 first presents the frame- t=1 m } work of MLMKL and then proposes a Two-Layer MKL ∑ µ = 1, µ ≥ 0, t = 1, . . . , m , (3) method, followed by the development of two efficient t t algorithms and the analysis of their generalization per- t=1 formance. Section 4 discusses an extensive set of ex- where each candidate kernel k is some combination of periments for performance evaluation over a testbed the m base kernels {k1, . . . , km}, and µi is the coeffi- with 16 publicly available benchmark data sets. Sec- cient of the ith base kernel. From (3), we can expand tion 5 concludes this paper. the decision function in (2) with the multiple kernels:

∑n ∑m ∑n ∑m f(x) = βi µtkt(xi, x) = βiµtkt(xi, x), i=1 t=1 i=1 t=1 (4)

910 Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi where the final kernel is a linear combination of m base (where degree n and level l parameters were chosen kernels. manually). No solution has been provided to optimize the kernel by learning the optimal parameters auto- Although it has been studied extensively, similar to matically. Our work was partially motivated by this SVM [27], the traditional MKL approach fall shorts work to address the above limitations. in that the resulting kernel machine is “shallow” since it often adopts a simple conic combination of multi- ple kernels and train the classifier with the combined 3 Multi-Layer Multiple Kernel kernel in a “flat” architecture, which may not be pow- Learning erful enough to fit diverse patterns in some real-world complicated tasks. In this Section, we first introduce a general framework of Multi-Layer Multiple Kernel Learning (MLMKL), 2.2 Deep Learning and Multilayer Kernels and then present an MLMKL paradigm, i.e., Two- Layer Multiple Kernel Learning. Recently, a lot of machine learning studies have ad- dressed one limitation of conventional learning tech- 3.1 Framework niques (such as SVM) regarding their shallow learning architectures. It has been shown that the deep archi- Following the optimization framework of MKL, the ba- tecture, such as multilayer neural nets, is often more sic idea of MLMKL is to relax the optimization do- preferable over the shallow ones. Very recently, Cho main K in traditional MKL optimization by adopting and Saul [6, 7] first introduced the idea of deep learn- a family of deep kernels. Specifically, we first define a ing to kernel methods, which can be applied either in domain of l-level multi-layer kernels as follows: deep architectures or in shallow structures, like SVM. { ( )} An l-layer kernel is the inner product after multiple (l−1) − K(l) = k(l)(·, ·) = g(l) [k (·, ·), . . . , k(l 1)(·, ·)] , feature mapping of inputs: 1 m ⟨ ⟩ (l) where g(l) is some function to combine multiple (l−1)- k (xi, xj) = Φ(Φ(| ...{z(Φ(xi))))}, Φ(Φ(| ...{z(Φ(xi))))} , level kernels, which must ensure the resulting combi- l times l times nation is a valid kernel. With this domain, in a way here Φ is the underlying feature mapping function of similar to regular MKL, we can formulate the opti- k, ⟨·, ·⟩ computes the inner product. mization problem of l-level MLMKL into:

Specifically, we consider an example of two-layer RBF ∑n ∥ ∥ kernel. An RBF kernel is typically defined as mink∈K(l) minf∈Hk λ f Hk + ℓ(yif(xi)). i=1 2 −γ∥xi−xj ∥ k(xi, xj) = e , To explain it intuitively, Figure 1 illustrates the archi- where γ > 0 is the kernel parameter. By applying tecture of an example three-layer MKL paradigm. the idea of two-layer kernel with the RBF kernel, the Despite sharing the similar optimization form, composition yields MLMKL is much more challenging than the conven- ⟨ ⟩ tional shallow MKL. This is because there are many Φ(Φ(xi)), Φ(Φ(xj)) unknown structures and variables, including the ini- −γ∥Φ(x )−Φ(x )∥2 −2γ(1−k(x ,x )) 2γk(x ,x ) = e i j = e i j = κe i j ,(5) tialization of base kernels, the unknown combination functions g(l) at each level, and the final prediction where κ is a constant that can be omitted. The similar model f. Apparently, it is not possible to fully op- idea can be applied for other types of kernels. In [6, timize every aspect. In the following, we attempt 7], the authors provided a multiple layer composition to attack this challenge by considering a simplified approach with respect to a special family of arc-cosine paradigm, i.e., Two-Layer Multiple Kernel Learning kernel functions. (2LMKL). Remark. The work studied in [6] has some limitations. First, the proposed multi-layer kernel was applied to 3.2 Two-Layer Multiple Kernel Learning only a single type of kernel, typically some special ker- nel function, such as the arc-cosine kernel [6]. In a To simplify the notations, we restrict our discussion real application, a more desirable solution is to allow on a Two-Layer Multiple Kernel Learning task in this a combination of a variety of different kernels when de- Section. Our algorithm can also be extended to gen- signing the deep kernel. Second, the multi-layer kernel eral multiple-layer MKL. Further, we employ an RBF proposed in [6] is often “static”, i.e., some fixed kernel kernel for the combination function g(2) and define the

911 Two-Layer Multiple Kernel Learning

Figure 1: The architecture of the proposed deep multiple kernel learning framework. Here shows an example of three-layer MKL, and some connections are not displayed to simplify the figure. two-layer multiple kernel domain as follows: The next challenge is how to resolve the above op- ( ) { ∑m timization. We consider an alternative optimization (2) (2) (1) scheme. That is, (1) fix α and solve µ; and (2) fix µ K = k (xi, xj; µ) = exp µtk (xi, xj) : t and solve α. Specifically, let us denote by J(α, µ) the } t=1 following function: ∈ Rm µ + , (6) 1 ∑n ∑n ∑m where µ denotes the weight of t-th antecedent-layer J(α, µ) = α α y y k(2)(x , x ; µ)− α − µ . t 2 i j i j i j i i kernel. Thus we can formulate the two-layer MKL i,j=1 i=1 i=1 with the kernel K(2): Since k(2) is positive semi-definite, the objective is con- 1 ∑n ∑m min min ∥f∥2 +C max(0, 1−y f(x ))+ µ . vex over α. Thus it can be solved by standard QP Hk i i t k∈K(2) f∈Hk 2 i=1 t=1 solvers for a fixed µ. However, for any pair (i, j) of (2) yiyj = −1, αiαjyiyjk (xi, xj; µ) ≤ 0. Thus, J is Note the last term is introduced as a regularization non-convex over µ. We simply compute the d-th com- to prevent the coefficients being too large. We can ponent of the gradient w.r.t. µ: further turn the above formulation into the following [ ] [ ] t t−1 equivalent min-max optimization: ∇ t−1 ∇ Jµ d := µJ(α , µ ) d n n m ∑ ∑ ∑ ∑ 1 (2) t−1 (1) 1 (2) = αiαjyiyjk (xi, xj; µ )k (xi, xj) − 1. (9) minµmaxα αi − αiαjyiyjk (xi,xj;µ)+ µt 2 d 2 ij i=1 i,j=1 t=1 ∑n Then we update µ by gradient ascent: s.t. 0≤αi ≤ C, αiyi =0, µt ≥0, t=1, . . . , m, (7) i=1 t t−1 µ = max(µ + η∇Jµt−1 , 0). ⊤ where α = [α1, . . . , αn] is the vector of dual vari- ables and µ = [µ , . . . , µ ]⊤. Once solving the above The step size η can be set by Armijo’s rule such that 1 m { } optimization to find the solutions for α and µ, it is the convergence is guaranteed. Let Θ = θ1, . . . , θm straightforward to obtain the final decision function of denote the set of hyper-parameters corresponding to (1) (2) the Two-Layer MKL machine: base kernels kt ’s inside k in (6). Algorithm 1 shows the detailed optimization steps of the proposed two- ∑n (2) layer MKL algorithm with the given Θ. f(x; α, µ) = αiyik (xi, x; µ) + b, (8) i=1 3.3 Improved Two-Layer MKL Algorithm where the bias term b can be easily determined from with Infinite Base Kernel Learning KKT conditions. Due to the nonlinearity of exp(·) function, we expect that the decision function in (8) Notice that, similar to traditional MKL algorithms, can represent richer prediction tasks than that in (4). our proposed MLMKL algorithm also must assume a

912 Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Algorithm 1 Two-Layer MKL (2LMKL): (α, µ) = After that, we employ a line search approach to de- n TwoLayerMKLwithTheta(Θ0; X1 ) termining the step size for gradient ascent. The pro- n posed improved two-layer MKL algorithm iterates be- Input: Training sample X1 , initial set of base kernel parameters Θ0 = {θ1, . . . , θm}; tween the following two steps: (1) iteratively solve the Output: weight vector µ of base kernels, dual vari- dual variables α and the kernel weight µ as similar to ables α of SVM. the previous algorithm; (2) add a new θ to Θ by the 1: Randomly initialize µ0, compute initial base ker- base kernel generation method. Finally, Algorithm 2 nels with Θ; summarizes the details of the improved two-layer mul- 2: repeat tiple kernel learning algorithm, which is denoted as Inf 3: Compute the current kernel matrix with µt−1; 2LMKL for short. 4: αt = arg min J(α, µt−1) by SVM solver; α Inf 5: Compute ∇ J(αt, µt−1) by (9) as descent di- Algorithm 2 Infinite Two-Layer MKL (2LMKL ): µ n rection; (α, µ, Θ) = TwoLayerMKL(Θ0; X1 ) t 6: Determine the step size η by Armijo’s rule, up- Input: Initial set of base kernel parameters Θ0, train- t t−1 t n date µ = max(µ + η ∇Jµt−1 ,0); ing sample X1 ; 7: until convergence Output: Final set of base kernel parameters Θ, weight vector µ of base kernels, dual variables α.

1: Initialize Θ = Θ0; set of predefined base kernels inside k(2) as in K(2) 2: while true do n provided beforehand. If the number of base kernels is 3: (α, µ) = TwoLayerMKLwithTheta(Θ; X1 ); (2) n too small, k may not flexible enough to fit compli- 4: θ = NewKernel(α; X1 ); cated patterns in a real problem. On the other hand, 5: if J(α, θ) ≤ J(α, µ) then both the time cost and space cost of MKL/MLMKL in- 6: break; (2) crease with the cardinality of Kconv/K . This may be 7: end if computationally inefficient when using too many base 8: Θ = Θ ∪ θ; kernels. Moreover, though the proposed deep MKL ar- 9: end while chitecture provides a flexibility for the design of multi- 10: n layer kernels, determining an appropriate set of base 11: Function θ = NewKernel(α; X1 ); kernels usually require some domain knowledge, which 12: Randomly initialize θ0; may be difficult for some non-expert users. 13: while J(α, θt−1) is improving do ∇ To partially address some of the above challenges, here 14: Compute the gradient Jθ by the similar ap- (2) proach in Eqn. (11); we propose to generate the base kernels inside k it- t t t−1 eratively. This can be done by selecting a base kernel 15: Determine a step size η , update θ = θ + t∇ that optimizes the objective function in (7), which is η Jθ; similar to the idea of infinite kernel learning [2, 11] ex- 16: end while cept that our base kernels are in the antecedent layer. Assume the inner base kernel is continuously param- eterized by θ, for example, the bandwidth parameter 3.4 Analysis of Generalization Performance of Gaussian kernel, or the degree of polynomial kernel. To expand the base kernel set K, we choose a θ such We are aware of the trend of seeking new kernel com- that the resultant single kernel maximizes J with the bination methods beyond the traditional MKL. How- current solution α: ever, the kernel is the prior knowledge of the data. The construction of kernel cannot be fully “automated”. ∑n 1 When we add more flexibility to kernel learning, we max J(α, θ) = αiαjyiyj exp(k(xi,xj;θ)). θ∈R+ 2 are also potentially increasing the difficulty of find- i,j=1 (10) ing the optimal kernel. It calls for theoretical analysis Again, this problem is non-convex over θ. Similar to on these generalized kernel combination methods. We solving µ, we compute the gradient of (10) w.r.t. θ base our analysis of two-layer MKL mainly on the no- and do gradient ascent. For example, if the inner base tion of pseudo-dimension of the kernel optimization domain K(2) [26]. kernel is a Gaussian kernel k(xi, xj; θ) = exp(−θ∥xi − 2 xj∥ ), the gradient can be then computed as follows: Theorem 1. [26] Let L(f) = P(f(xi)yi ≤ 0) be the generalization risk of some prediction func- 1 ∑ (k(xi,xj ;θ)) 2 ∇Jθ= αiαjyiyje k(xi, xj; θ)∥xi−xj∥ . tion∑ f learned by solving (1), and Ln(f) = 1 n 2 ∈N i,j n n i=1 1(f(xi)yi < γ) be the empirical error. For (11) a kernel family K with pseudo-dimension dK, the gen-

913 Two-Layer Multiple Kernel Learning

Table 1: The statistics of the 16 binary-class data sets used in our experiments. Data Set Breast Ionosphere Diabetes Waveform Sonar Adult Liver German # instances 683 351 768 400 208 1,605 345 1,000 # dimensions 10 33 8 21 60 123 6 24 Data Set Splice Australian Thyroid Ringnorm Heart Banana Titanic FlareSolar # instances 1,000 690 140 400 270 400 150 666 # dimensions 60 14 5 20 13 2 3 9 eralization risk of f is bounded as: monotonic, by applying Theorem 11.3 of [1], we arrive √ at 2 dK(2) = d ◦ ◦K(2) ≤ d ◦K(2) ≤ m. (12) L(f) ≤ Ln(f) + O˜(dK + 1/γ )/n, exp ln ln where γ is the margin in loss function l(t) = − ˜ max(0, γ t), the O notation hides logarithmic fac- Remark: Despite the simplicity of its proof, Theorem 2 tors in its argument, the sample size and the allowed implies that the outer exponential computation of our failure probability. new kernel does not increase the complexity of the ker- nel domain in terms of pseudo-dimension. Comparing Here the pseudo-dimension dK measures the richness with MKL, our MLMKL method, with more flexible or / complexity of a kernel domain K: richer optimization domain, would thus have a better chance in finding the best prediction function without Definition 1. Let K be a set of kernel functions map- increasing the generalization risk explicitly. ping from X × X to R. We say that a set of paired { ′ ∈ X × X } examples Sn = (xi, xi) , i = 1, . . . , n Recently, Ying and Campbell [30] employed the K b is pseudo-shattered by if there are real numbers Rademacher chaos complexity Un(K) to measure the r ∈ Rn} such that for any b ∈ {−1, 1}n there is a richness of K through its ability of fitting noisy simi- ∈ K ′ − function k with property sgn(k(xi, xi) ri) = bi larity value: for any (x , x′ ) ∈ S . Then, we define a pseudo- i i n ∑ dimension dK of K to be the largest n such that there Ub K n E 1 ∈K ∈ n n( ; X1 ) = ε sup εiεjk(xi, xj) : k , xi X1 , K ∈K n exist a Sn that can be pseudo-shattered by . k i

(1) (1) where ε is a Rademacher random variable taking value Theorem 2. Let K(1) = {k , . . . , k } be the inner i 1 m ±1 with uniform probability. We have base kernel family for the 2-layer deep kernel K(2) de- fined in (6), where m is the number of base kernels. Corollary 1. The Rademacher chaos complexity of Assuming the evaluation of k(1) always positive, then K(2) is bounded by the pseudo-dimension dK(2) is bounded by b (2) 2 Un(K ) ≤ (192e + 1)κ m, (13) ≤ dK(2) m. (2) here κ = maxx k (x, x), n is the size of training sam- ple, e is natural constant. Proof. First, we re-write K(2) as follows: { ∏ ( ) } The above corollary can be followed directly by com- K(2) (1) (1) ∈ K(1) = exp µtkt : kt . bining Theorem 2 and the Rademacher chaos complex- ity result in [31, Theorem 3]. Finally, the generaliza- (2) ∈ K(2) b (2) Thus each k is the product of base kernels tion bound based on Un(K ) can be obtained immedi- in form of exp(µk(1)). Consider the logarithmic oper- ately by combining the above Corollary with Lemma 9 ation ln ◦K, where for each kernel k ∈ K and any pair of [30]. ◦ (xi, xj), we have{ (ln∑ k)(xi, xj) = ln k(xi,}xj). There- ◦K(2) (1) (1) ∈ K(1) fore, ln = µtkt : kt . This is a 4 Experiments linear space of dimension at most m (with basis K(1) (1) when all k are linearly independent). According 4.1 Experimental Testbed and Setup to Theorem 11.4 of [1], the Pseudo-dimension [26] of ◦K(2) ≤ ln is bounded by dln ◦K(2) m. We can recover We evaluate the performance of the proposed Two- K(2) = exp ◦ ln ◦K(2). Since the exponential function is Layer MKL algorithms for binary classification tasks

914 Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Table 2: The evaluation of classification performance by comparing with a number of different algorithms. Each element in the table shows the mean and standard deviation of classification accuracy (%). The relative ranking of different MKL algorithms on each data is shown in (). The last row shows the average rank score over all data sets achieved by each algorithm.

Data Set SVM MKLlevel LpMKL GMKL IKL MKM 2LMKL 2LMKLInf Breast 96.81.0 96.50.8 (5) 96.20.7 (7) 97.01.0 (2) 96.50.7 (5) 97.11.0 (1) 97.01.0 (2) 96.90.7 (4) Diabetes 76.71.8 75.82.5 (4) 72.62.5 (6) 66.42.5 (7) 76.03.0 (3) 75.82.5 (4) 76.61.6 (1) 76.61.9 (1) Australian 84.61.4 85.01.5 (5) 84.51.6 (6) 80.02.3 (7) 85.41.2 (3) 85.30.9 (4) 85.51.6 (2) 85.71.6 (1) Splice 85.01.4 88.42.4 (3) 87.11.6 (4) 92.41.4 (2) 80.01.5 (7) 84.61.5 (6) 92.91.1 (1) 84.71.2 (5) FlareSolar 67.52.0 67.62.0 (2) 64.81.8 (5) 65.31.8 (3) 64.81.8 (5) 64.41.7 (7) 68.11.8 (1) 65.31.8 (3) Titanic 78.03.2 77.12.9 (2) 77.03.0 (5) 76.73.1 (7) 76.82.8 (6) 77.13.0 (2) 77.13.0 (2) 77.82.6 (1) Iono 92.82.0 91.71.9 (7) 92.61.4 (4) 92.71.8 (3) 93.71.0 (2) 91.72.7 (6) 92.31.5 (5) 94.40.9 (1) Banana 89.71.5 90.22.0 (1) 87.52.6 (4) 83.42.7 (6) 90.21.8 (1) 80.55.3 (7) 86.82.1 (5) 90.21.6 (1) Ringnorm 98.50.7 98.10.8 (3) 96.71.0 (7) 97.51.0 (6) 98.50.7 (1) 97.71.0 (5) 97.90.8 (4) 98.50.8 (1) Waveform 89.01.8 88.21.6 (6) 88.92.0 (4) 88.21.8 (6) 89.72.3 (3) 90.01.6 (2) 88.71.9 (5) 90.41.6 (1) Heart 82.13.0 83.02.9 (4) 76.73.8 (7) 77.03.6 (6) 83.32.1 (2) 82.42.5 (5) 83.12.5 (3) 83.62.1 (1) Sonar 83.83.4 78.33.5 (7) 84.83.2 (1) 78.84.6 (6) 81.05.0 (4) 83.13.8 (3) 79.04.3 (5) 84.62.4 (2) Thyroid 93.92.9 92.92.9 (6) 93.12.2 (5) 94.62.1 (3) 94.82.0 (1) 92.63.0 (7) 93.43.1 (4) 94.82.2 (1) Liver 70.54.1 62.34.5 (6) 69.42.9 (2) 63.62.6 (4) 60.02.9 (7) 70.13.6 (1) 66.03.4 (3) 62.73.1 (5) Adult 82.00.7 81.51.0 (4) 82.10.6 (1) 75.51.1 (6) 75.11.1 (7) 81.70.9 (3) 79.12.4 (5) 81.81.0 (2) German 75.21.9 71.42.8 (5) 74.31.4 (3) 70.41.6 (6) 70.01.5 (7) 75.72.3 (1) 74.81.8 (2) 74.22.0 (4) Rank N/A 4.38 4.44 5.00 4.00 4.00 3.13 2.13

over a testbed of 16 publicly available data sets as MKLLevel: The convex multiple kernel learning algo- 1 2 shown in Table 1 . rithm, that is, the target kernel class is Kconv de- fined in (3). We use the extended level method Following the settings of previous MKL studies [29], [29] to learn the kernel; for each data set, we create the set of base kernels K as follows: (1) Gaussian kernels with 10 different widths LpMKL: The MKL algorithm with L norm regulariza- { −3 −2 6} p ( 2 , 2 ,..., 2 ) on all features and on each single tion over the kernel weight [16]. We adopt their feature; (2) polynomial kernels of degree 1 to 3 on all cutting plane algorithm with second order Taylor features and on each single feature. Each base ker- approximation of Lp; nel matrix is normalized to unit trace. For each data set, we randomly sample 50% of all instances as train- GMKL: The Generalized MKL algorithm in [28]. The ing data, and use the rest as test data. The training target kernel class is the Hadamard product of instances are normalized to be of zero mean and unit single Gaussian kernel defined on each dimension; variance, and the test instances are also normalized us- ing the same mean and variance of the training data. IKL: The Infinite Kernel Learning algorithm proposed To get stable results, for each data set, we repeat each by [11]. We use LevelMKL as the embedded al- algorithm 20 times and compute the average results of gorithm to solve the kernel weight µ and α; the 20 runs. MKM: The Multilayer Kernel Machine with deep learn- For comparison, we have tried our best to compare ing [6], which essentially trained SVM with mul- as many state-of-the-art MKL methods as possible, tilayer arc-cosine kernel functions; which were proposed under different contexts for vari- ous applications. The goal of our experiment is mainly 2LMKL: The proposed Two-Layer MKL algorithm de- to examine if deep MKL is effective for improving the scribed in Algorithm 1; performance of the shallow MKL techniques. Specifi- cally, we have compared the following algorithms: 2LMKLInf : The proposed Infinite Two-Layer MKL al- gorithm described in Algorithm 2. SVM: The Support Vector Machine algorithm with a single Gaussian kernel. The band-width param- For parameter settings, the regularization parameter eter is selected via 5-fold cross validation on the C in MKL or our 2LMKL algorithms is determined training data; by 5-fold cross validation on the training data over 1 −2 −1 2 http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ the range of {10 , 10 ,..., 10 }. For a fair com- 2 http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark parison, the same set of base kernels was adopted by

915 Two-Layer Multiple Kernel Learning

MKLLevel, LpMKL, and 2LMKL. For LpMKL, we ex- ods and the MKM method, over quite a number of amine p = 2, 3, 4 and report the best result. For fair data sets. Specifically, among all the MKL methods, comparison, in MKM we chose the layer l = 2, and 2LMKL won five best cases while 2LMKLInf won 11 found the best degree parameter ({0,1,2}) by cross best cases out of 16 cases. The encouraging perfor- validation. For 2LMKLInf , the initial base kernel is mance showed our 2LMKL method is more effective a Gaussian kernel with 10 parameters and polynomial than the regular MKL methods through the explo- kernel with 3 degrees calculated on all the features. ration of deep kernel learning capability, and is also During the iterative process, we add one Gaussian ker- more effective than the previous MKM method with nel at each iteration. deep learning. Finally, comparing the two proposed two-layer MKL 4.2 Performance Analysis algorithms themselves, we observed that 2LMKLInf performed better than 2LMKL. This validates the ef- Table 2 shows the detailed results of average classi- ficacy of the proposed improvement by exploiting the fication accuracy and standard deviation values. To idea of indefinite kernel learning. compare the overall performance, we count the ranks of the algorithms according to their performance on each data set. The average rank is included in the last 5 Conclusion row. From the results, we can draw several observa- tions as follows. This paper presented a general framework of multi- First of all, by comparing the results between SVM layer multiple kernel learning (MLMKL) to overcome and the four existing MKL methods, we found that the shallow learning nature of regular MKL. Under the the existing MKL algorithms do not always outper- framework, we propose a Two-Layer Multiple Kernel form SVM with an RBF kernel. For example, for the Learning (2LMKL) method, and developed two effec- MKLLevel algorithm, it outperformed SVM only over tive algorithms to solve it. We analyzed the general- five data sets (Australian, Splice, FlareSolar, Banana, ization risk of the proposed two-layer MKL algorithms and Heart), and was surpassed significantly by SVM and conducted an extensive set of experiments. Our over several data sets (German, Sonar, and Liver, etc). empirical results showed that the proposed 2LMKL Although it seems a little bit surprising, the similar algorithms usually perform better than the existing observation was reported in some previous empirical shallow MKL methods, demonstrating the efficacy of study [11], which also found that regular MKL does the 2LMKL approach. not always outperform SVM with an RBF kernel who Despite the promising results, MLMKL remains a kernel parameter was tuned by cross validation. This rather new area for future research. In our future observation validates our motivation for overcoming work, we plan to extend the current two-layer MKL the shallow limitation of the regular MKL methods. scheme to higher-layer MKL solutions to further en- Second, by comparing the four existing MKL meth- hance the efficacy. Akin to the training scheme of ods, we observed that IKL overall achieved the best deep learning[13], one can learn µ(l) in a bottom-up performance among them, and GMKL tended to per- and layer-wise manner. In other words, we can em- (l) form slightly worse than the other methods. Specif- bed the learned kernel ks (summation of base ker- ically, among all the MKL methods, IKL won three nels at the current layer) into the base kernel functions best cases, LpMKL won two best cases, while either to generate the candidate kernels k(l+1), . . . , k(l+1) for Level 1 n MKL or GMKL won only one out of 16 cases. the next layer. Then conventional MKL algorithms We conjecture the reason that IKL performed better are adopted to solve the weight µ(l+1). This process is probably because IKL has the possibility of using can be repeated to construct deep kernels. We will a largely increased kernel set, which may be flexible also analyze the overfitting issue for MLMKL and in- for the classification task. Similarly, the reason that vestigate more theoretical insights about the power of GMKL performed worse may be due to the relatively multi-layer multiple kernel learning. smaller kernel set, in which the base kernel set consists of only d kernels each of them was defined on a single dimension. Acknowledgments Third, by examining the results achieved the two pro- posed algorithms, 2LMKL and 2LMKLInf , we found This research was in part supported by Singapore that both of them achieved rather impressive per- A* SERC Grant (102 158 0034), MOE Tier-1 Grant formance. Both of them considerably outperformed (RG15/08), Tier-1 Grant (RG67/07) and Tier-2 Grant the other methods, including the existing MKL meth- (T208B2203).

916 Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

References [18] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the ker- [1] M. Anthony and P. Bartlett. Neural Networks nel matrix with semidefinite programming. Jour- Learning: Theoretical Foundations. Cambridge nal of Machine Learning Research, 5:27–72, 2004. University Press, 1999. [19] H. Larochelle, D. Erhan, A. C. Courville, [2] A. Argyriou, C. A. Micchelli, and M. Pontil. J. Bergstra, and Y. Bengio. An empirical evalua- Learning convex combinations of continuously pa- tion of deep architectures on problems with many rameterized basic kernels. In COLT, pages 338– factors of variation. In ICML, pages 473–480, 352, 2005. 2007. [3] F. Bach. Exploring large feature spaces with hier- [20] S. Mika, B. Sch¨olkopf, A. J. Smola, K.-R. M¨uller, archical multiple kernel learning. In NIPS, pages M. Scholz, and G. R¨atsch. Kernel pca and de- 105–112, 2008. noising in feature spaces. In NIPS, pages 536–542, [4] F. Bach, G. Lanckriet, and M. I. Jordan. Mul- 1998. tiple kernel learning, conic duality, and the SMO [21] A. Rakotomamonjy, F. Bach, S. Canu, and algorithm. In ICML, 2004. Y. Grandvalet. More efficiency in multiple ker- [5] Y. Bengio. Learning deep architectures for ai. nel learning. In ICML, pages 775–782, 2007. Foundations and Trends in Machine Learning, [22] B. Sch¨olkopf, R. Herbrich, and A. J. Smola. 2(1):1–127, 2009. A generalized representer theorem. In [6] Y. Cho and L. K. Saul. Kernel methods for deep COLT/EuroCOLT, pages 416–426, 2001. learning. In NIPS, pages 342–350, 2009. [23] J. Shawe-Taylor and N. Cristianini. Kernel Meth- [7] Y. Cho and L. K. Saul. Large-margin classifica- ods for Pattern Analysis. Cambridge University tion in infinite neural networks. Neural Compu- Press, New York, NY, USA, 2004. tation, 22(10):2678–2697, 2010. [24] S. Sonnenburg, G. R¨atsch, and C. Sch¨afer. A [8] C. Cortes, M. Mohri, and A. Rostamizadeh. general and efficient multiple kernel learning al- Learning non-linear combinations of kernels. In gorithm. In NIPS, 2005. NIPS, 2009. [25] S. Sonnenburg, G. R¨atsch, C. Sch¨afer, and [9] C. Cortes and V. Vapnik. Support-vector net- B. Sch¨olkopf. Large scale multiple kernel learning. works. Machine Learning, 20(3):273–297, 1995. Journal of Machine Learning Research, 7:1531– 1565, 2006. [10] N. Cristianini and J. Shawe-Taylor. An intro- duction to support Vector Machines: and other [26] N. Srebro and S. Ben-David. Learning bounds for kernel-based learning methods. Cambridge Uni- support vector machines with learned kernels. In versity Press, New York, NY, USA, 2000. COLT, pages 169–183, 2006. [11] P. V. Gehler and S. Nowozin. Infinite kernel learn- [27] V. N. Vapnik. Statistical Learning Theory. Wiley, ing. In TECHNICAL REPORT NO. TR-178,Max 1998. Planck Institute for Biological Cybernetics, 2008. [28] M. Varma and B. R. Babu. More generality in [12] M. G¨onenand E. Alpaydin. Localized multiple efficient multiple kernel learning. In ICML, page kernel learning. In ICML, pages 352–359, 2008. 134, 2009. [13] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast [29] Z. Xu, R. Jin, I. King, and M. R. Lyu. An ex- learning algorithm for deep belief nets. Neural tended level method for efficient multiple kernel Computation, 18(7):1527–1554, 2006. learning. In NIPS, pages 1825–1832, 2008. [14] G. E. Hinton and R. Salakhutdinov. Reducing [30] Y. Ying and C. Campbell. Generalization bounds the dimensionality of data with neural networks. for learning the kernel. In COLT, 2009. Science, 313(5786):504–507, 2006. [31] Y. Ying and C. Campbell. Rademacher [15] T. Hofmann, B. Scholkopf, and A. J. Smola. Ker- chaos complexity for learning the ker- nel methods in machine learning. The Annals of nel problem. In TECHNICAL REPORT, Statistics, 36(3):1171–1220, 2008. http://secamlocal.ex.ac.uk/people/staff/yy267/ [16] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, KLbound-version3.pdf, 2010. K.-R. Muller, and A. Zien. Efficient and accurate [32] J. Zhu and T. Hastie. Kernel logistic regression lp-norm multiple kernel learning. In NIPS, 2009. and the import vector machine. In NIPS, pages [17] J. T. Kwok and I. W. Tsang. The pre-image prob- 1081–1088, 2001. lem in kernel methods. IEEE Transactions on Neural Networks, 15(6):1517–1525, 2004.

917