Simple and Efficient Multiple Kernel Learning by Group Lasso

Simple and Efficient Multiple Kernel Learning by Group Lasso Zenglin Xu [email protected] Cluster of Excellence MMCI, Saarland University and MPI Informatics, Saarbruecken, Germany Rong Jin [email protected] Department of Computer Science & Engineering, Michigan State University, East Lansing, MI 48824 USA Haiqin Yang, Irwin King, Michael R. Lyu {HQYANG, KING, LYU}@CSE.CUHK.EDU.HK Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong Abstract 2004b; Ongetal., 2005). It has been regarded as a promising technique for identifying the combina- We consider the problem of how to improve tion of multiple data sources or feature subsets and the efficiency of Multiple Kernel Learning been applied in a number of domains, such as (MKL). In literature, MKL is often solved by genome fusion (Lanckrietet al., 2004a), splice site an alternating approach: (1) the minimization detection (Sonnenburg et al., 2006), image annotation of the kernel weights is solved by compli- (Harchaoui & Bach, 2007) and so on. cated techniques, such as Semi-infinite Lin- ear Programming, Gradient Descent, or Level Multiple kernel learning searches for a combination of method; (2) the maximization of SVM dual base kernel functions/matrices that maximizes a gener- variables can be solved by standard SVM alized performance measure. Typical measures studied solvers. However, the minimization step in for multiple kernel learning, include maximum margin these methods is usually dependent on its classification errors (Lanckriet et al., 2004b; Bach et al., solving techniques or commercial softwares, 2004; Argyriou et al., 2006; Zien & Ong, 2007), kernel- which therefore limits the efficiency and ap- target alignment (Cristianini et al., 2001), Fisher dis- plicability. In this paper, we formulate a criminative analysis (Ye et al., 2007), etc. closed-form solution for optimizing the ker- There are two active research directions in multiple nel weights based on the equivalence between kernel learning. One is to improve the efficiency of group-lasso and MKL. Although this equiva- MKL algorithms. Following the Semi-Definite Program- lence is not our invention, our derived variant ming (SDP) algorithm proposed in the seminal work equivalence not only leads to an efficient algo- of (Lanckrietet al., 2004b), in (Bach etal., 2004), a rithm for MKL, but also generalizes to the case block-norm regularization method based on Second Or- for L -MKL (p ≥ 1 and denoting the L -norm p p der Cone Programming(SOCP) was proposed in order to of kernel weights). Therefore, our proposed al- solve medium-scale problems. Due to the high compu- gorithm provides a unified solution for the en- tation cost of SDP and SOCP, these methods cannot pro- tire family of L -MKL models. Experiments p cess more kernels and more training data. Recent studies on multiple data sets show the promising per- suggest that alternating approaches (Sonnenburg et al., formance of the proposed technique compared 2006; Rakotomamonjy et al., 2008; Xu et al., 2009a) are with other competitive methods. more efficient. These approaches alternate between the optimization of kernel weights and the optimization of SVM classifiers. In each step, given the cur- 1. Introduction rent solution of kernel weights, it solves a classical Multiple kernel learning (MKL) has been an at- SVM with the combined kernel; then a specific pro- tractive topic in machine learning (Lanckrietet al., cedure is used to update the kernel weights. The advantage of the alternating scheme is that SVM solvers Appearing in Proceedings of the 27 th International Conference can be very efficient due to recent advances in large on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the scale optimization (Bottou & Lin, 2007). However, the author(s)/owner(s). current approaches for updating the kernel weights are still time consuming and most of them depend on solves the related optimization problem without a Tay- commercial softwares. For example, the well-known lor approximation. Experimental results on multiple data Shogun (www.shogun-toolbox.org/) toolbox of sets show the promising performanceof the proposed op- MKL employs CPLEX for solving the linear program- timization method. ming problem. The rest of this paper is organized as follows. Section The second direction is to improve the accuracy of MKL 2 presents the related work on multiple kernel learning. by exploring the possible combination ways of base ker- Section 3 first presents the variational equivalence be- nels. L1-norm of the kernel weights, also known as tween MKL and group lasso, followed by the optimiza- the simplex constraint, is mostly used in MKL meth- tion method and its generalization to Lp-norm MKL. ods. The advantage of the simplex constraint is that Section 4 shows the experimental results and Section 5 it leads to a sparse solution, i.e., only a few base ker- concludes this paper. nels among manys carry significant weights. However, as argued in (Kloft et al., 2008), the simplex constraint 2. Related Work may discard complementary information when base ker- n×d nels encodes orthogonal information, leading to subop- Let X = (x1,..., xn) ∈ R denote the collection timal performance. To improve the accuracy in this of n training samples that are in a d-dimensional space. n scenario, an L2-norm of the kernel weights, known as Let y = (y1,y2,...,yn) ∈ {−1, +1} denote the bi- a ball constraint, is introduced in their work. A nat- nary class labels for the data points in X. Multiple ker- ural extension to the L2-norm is the Lp-norm, which nel learning is often cast into the following optimization is approximated by the second order Taylor expansion problem: and therefore leads to a convex optimization problem n (Kloft et al., 2009). Another possible extension is to ex- 1 2 min kfkHγ + C `(yif(xi)), (1) f∈Hγ 2 plore the grouping property or the mixed-norm combi- i=1 nation, which is helpful when there are principle com- X ponents among the base kernels (Saketha Nath et al., where Hγ is a reproducing kernel Hilbert space parame- terized by γ and `(·) is a loss function. Hγ is endowed 2009; Szafranski et al., 2008). Other researchers also m study the possibility of non-linear combination of kernels with kernel function κ(·, ·; γ)= j=1 γj κj (·, ·). (Varma & Babu, 2009; Cortes et al., 2009b). Although When the Hinge loss is employed,P the dual problem of the solution space has been enlarged, the non-linear com- MKL (Lanckriet et al., 2004b) is equivalent to: bination usually results in non-convex optimization problem, leading to even higher computational cost. More- m > 1 > over, the solution of nonlinear combination is difficult to min max 1 α − (α ◦ y) γj Kj (α ◦ y), (2) γ∈∆ α∈Q 2 j=1 interpret. It is important to note that the choice of com- X bination usually depends on the composition of base ker- where ∆ is the domain of γ and Q is the domain of α. 1 nels and it is not the case that a non-linear combination m is a vector of all ones, {Kj}j=1 is a group of base kernel is superior than the traditional L1-norm combination. 0 matrices associated with Hj , and ◦ defines the element- To this end, we first derive a variation of the equiv- wise product between two vectors. The domain Q is usu- alence between MKL and group lasso (Yuan& Lin, ally defined as: 2006). Based on the equivalence, we transform the re- Rn > lated convex-concave optimization of MKL into a joint- Q = {α ∈ : α y =0, 0 ≤ α ≤ C}. (3) minimization problem. We further obtain a closed-form solution for the kernel weights, which therefore greatly It is interesting to discuss the domain of γ. When γ ∈ ∆ Rm m improves the computational efficiency. It should be lies in a simplex, i.e., ∆ = {γ ∈ + : j=1 γj = noted that although the consistency between MKL and 1, γj ≥ 0}, we call it L1-norm of kernel weights. Most P group lasso is discussed in (Rakotomamonjy et al., 2008; MKL methods fall in this category. Correspondingly, Rm Bach, 2008), our obtained variational equivalence leads when ∆ = {γ ∈ + : kγkp ≤ 1, γj ≥ 0}, we call to a stable optimization algorithm. On the other hand, the it Lp-norm of kernel weights and the resulting model proposed optimization algorithm could also be instruc- Lp-MKL. A special case is L2-MKL (Kloft et al., 2008; tive to the optimization of group-lasso. We further show Cortes et al., 2009a). that Lp-norm formulation for MKL is also equivalent to It is easy to verify that the overall optimization problem an optimization function with a different group regular- (2) is convex on γ and concave on α. Thus the above op- izer, which does not appear in literature. Compared to timization problem can be regarded as a convex-concave the approach in (Kloft et al., 2009), our approachdirectly problem. Based on the convex-concave nature of the above prob- tween MKL and group lasso is not our invention (Bach, lem, a number of algorithms have been proposed, which 2008), the variational formulation derived in this work alternate between the optimization of kernel weights and is more effective in motivating an efficient optimization the optimization of the SVM classifier. In each step, algorithm for MKL. Furthermore, the alternative formu- given the current solution of kernel weights, they solve lation will lead to the equivalence between Lp-MKL a classical SVM with the combined kernel; then a spe- (p ≥ 1) and group regularizer. cific procedure is used to update the kernel weights. For example, the Semi-Infinite Linear Programming 3.1. Connection between MKL and Group Lasso (SILP) approach, developed in (Sonnenburg et al., 2006; Kloft et al., 2009), constructs a cutting plane model for For convenience of presentation, we first start from the the objective function and updates the kernel weights by setting of L1-norm for kernel weights.

Load more