Bayesian Efficient Multiple Kernel Learning

Bayesian Efficient Multiple Kernel Learning Mehmet Gönen [email protected] Helsinki Institute for Information Technology HIIT Department of Information and Computer Science, Aalto University School of Science ⊤ Abstract respectively, and k⋆ = k(x1, x⋆) ... k(xN , x⋆) where k : X×X → R is the kernel function that cal- Multiple kernel learning algorithms are proposed to combine kernels in order to obtain culates a similarity measure between two data points. a better similarity measure or to integrate Using the theory of structural risk minimization, the feature representations coming from different model parameters can be found by solving a quadratic data sources. Most of the previous research programming problem, known as support vector ma- on such methods is focused on the computa- chine (SVM) (Vapnik, 1998). The model parameters tional efficiency issue. However, it is still not can also be interpreted as random variables to obtain feasible to combine many kernels using ex- a Bayesian interpretation of the model, known as rel- isting Bayesian approaches due to their high evance vector machine (RVM) (Tipping, 2001). time complexity. We propose a fully conju- Kernel selection (i.e., choosing a functional form and gate Bayesian formulation and derive a de- its parameters) is the most important issue that affects terministic variational approximation, which the empirical performance of kernel-based algorithms allows us to combine hundreds or thousands and is usually done using a cross-validation procedure. of kernels very efficiently. We briefly explain Multiple kernel learning (MKL) methods have been how the proposed method can be extended proposed to make use of multiple kernels simultane- for multiclass learning and semi-supervised ously instead of selecting a single kernel (see a recent learning. Experiments with large numbers survey by Gönen & Alpaydın (2011)). Such methods of kernels on benchmark data sets show that also provide a principled way of integrating feature our inference method is quite fast, requiring representations coming from different data sources or less than a minute. On one bioinformatics modalities. Most of the previous research is focused and three image recognition data sets, our on developing efficient MKL algorithms. Nevertheless, method outperforms previously reported re- existing Bayesian MKL methods are problematic in sults with better generalization performance. terms of computation time when combining hundreds or thousands of kernels. In this paper, we formulate a very efficient Bayesian MKL method that solves this 1. Introduction issue by formulating the combination in a novel way. The main idea of kernel-based algorithms is to learn a In Section 2, we give an overview of the related work linear decision function in the feature space where data by considering existing discriminative and Bayesian points are implicitly mapped to using a kernel function MKL algorithms. Section 3 gives the details of the (Vapnik, 1998). Given a sample of N independent and proposed fully conjugate Bayesian formulation, called N identically distributed training instances {xi ∈X}i=1, Bayesian efficient multiple kernel learning (BEMKL). the decision function that is used to predict the target In Section 4, we explain detailed derivations of our de- output of an unseen test instance x⋆ can be written as terministic variational approximation for binary clas- ⊤ sification. Extensions towards multiclass learning and f(x⋆)= a k⋆ + b (1) semi-supervised learning are summarized in Section 5. where the vector of weights assigned to each train- Section 6 evaluates BEMKL with large numbers of ker- ing data point and the bias are denoted by a and b, nels on standard benchmark data sets in terms of time complexity, and reports the classification results on Proceedings of the 29 th International Confer- Appearing in one bioinformatics and three image recognition tasks, ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). which are frequently used to compare MKL methods. Bayesian Efficient Multiple Kernel Learning 2. Related Work propose a very efficient method, called SMO-MKL, to train ℓp-norm (p> 1) MKL models with squared norm MKL algorithms basically replace the kernel in (1) as the regularization term using SMO algorithm for with a combined kernel calculated as a function of the solving MKL problem directly instead of solving inter- input kernels. The most common combination is to use mediate SVMs at each iteration. R P a weighted sum of P kernels {km : X×X→ }m=1: Most of the discriminative MKL algorithms are devel- P ⊤ oped for binary classification. One-versus-all or one- f(x⋆)= a emkm,⋆ +b versus-other strategies can be employed to get mul- m=1 ! X ticlass learners. However, there are also some di- ke,⋆ rect formulations for multiclass learning. Zien & Ong | {z } (2007) give a multiclass MKL algorithm by formu- where the vector of kernel weights is denoted by e lating the problem as an SILP and show that their ⊤ and km,⋆ = km(x1, x⋆) ... km(xN , x⋆) . Exist- method is equivalent to multiclass generalizations ing MKL algorithms with a weighted sum differ in of Bach et al. (2004) and Sonnenburg et al. (2006). the way that they formulate restrictions on the kernel Gehler & Nowozin (2009) propose a boosting-type weights: arbitrary weights (i.e., linear sum), nonnega- MKL algorithm that combines outputs calculated from tive weights (i.e., conic sum), or weights on a simplex each kernel separately and obtain better results than (i.e., convex sum). MKL algorithms with SILP and SD approaches on im- Bach et al. (2004) formulate the problem as a second- age recognition problems. order cone programming (SCOP) problem, which is Girolami & Rogers (2005) present Bayesian MKL al- formulated as a semidefinite programming problem gorithms for regression and binary classification us- previously by Lanckriet et al. (2004). However, SCOP ing hierarchical models. Damoulas & Girolami (2008) can only be solved for medium-scale problems effi- give a multiclass MKL formulation using a very simi- ciently. Sonnenburg et al. (2006) reinterpret the prob- lar hierarchical model. The combined kernel in these lem as a semi-infinite linear programming (SILP) prob- two studies is defined as a convex sum of the input ker- lem, which can be applied to large-scale data sets. nels using a Dirichlet prior on the kernel weights. As Rakotomamonjy et al. (2008) develop a simple MKL a consequence of the nonconjugacy between Dirichlet algorithm using a sub-gradient descent (SD) approach, and normal distributions, they choose to use an impor- which is faster than SILP method. Later, Xu et al. tance sampling scheme to update the kernel weights (2009) extend the level method, which is originally de- when deriving variational approximations. Recently, signed for optimizing non-smooth objective functions, Zhang et al. (2011) propose a fully Bayesian inference to obtain a very efficient MKL algorithm that carries methodology for extending generalized linear models flavors from both SILP and SD approaches but out- to kernelized models using a Markov chain Monte performs them in terms of computation time. Carlo (MCMC) approach. The main issue with these The aforementioned methods tend to produce sparse approaches is that they depend on some sampling kernel combinations, which corresponds to using the strategy and may not be trained in a reasonable time when the number of kernels is large. ℓ1-norm on the kernel weights. Sparsity at the kernel level may harm the generalization performance Girolami & Zhong (2007) formulate a Gaussian pro- of the learner and using non-sparse kernel combi- cess (GP) variant that uses multiple covariances (i.e., nations (e.g., the ℓ2-norm) may be a better choice kernels) for multiclass classification using a variational (Cortes et al., 2009). Varma & Babu (2009) propose a approximation or expectation propagation scheme, generalized MKL algorithm that can use any differen- which requires an MCMC sub-sampler for the covari- tiable and continuous regularization term on the kernel ance weights. Titsias & Lázaro-Gredilla (2011) pro- weights. This also allows us to integrate prior knowl- pose a multitask GP model that combines a common edge about the kernels to the model. Xu et al. (2010) set of GP functions (i.e., information sharing between and Kloft et al. (2011) independently and in parallel the tasks) defined over multiple covariances with task- develop an MKL algorithm with the ℓp-norm (p ≥ 1) dependent weights whose sparsity is tuned using the on the kernel weights. This method has a closed-form spike and slab prior. A variational approximation ap- update rule for the kernel weights and requires only proach is derived for an efficient inference scheme. an SVM solver for optimization. Sequential minimal optimization (SMO) algorithm is the most commonly Our main motivation for this work is to formulate an used method for solving SVM problems and efficiently efficient Bayesian inference approach without resorting scales to large problems. Vishwanathan et al. (2010) to expensive sampling procedures. Bayesian Efficient Multiple Kernel Learning 3. Bayesian Efficient Multiple Kernel As short-hand notations, all priors in the model are Learning denoted by Ξ = {γ, λ, ω}, where the remaining variables by Θ = {a, b, e, f, G} and the hyper-parameters In order to obtain an efficient Bayesian MKL algo- by ζ = {αγ,βγ, αλ,βλ, αω,βω}. Dependence on ζ is rithm, we formulate a fully conjugate probabilistic omitted for clarity throughout the manuscript. model and develop a deterministic variational approximation mechanism for inference. We give the details The distributional assumptions of our proposed model for binary classification, but the same model can easily are defined as be extended to regression. λi ∼ G(λi; αλ,βλ) ∀i −1 Figure 1 illustrates the proposed probabilistic model ai|λi ∼ N (ai;0, λi ) ∀i for binary classification with a graphical model.

Load more