Bayesian Efficient Multiple Kernel Learning

Mehmet G¨onen [email protected] Helsinki Institute for Information Technology HIIT Department of Information and Computer Science, Aalto University School of Science

⊤ Abstract respectively, and k⋆ = k(x1, x⋆) ... k(xN , x⋆) where k : X×X → R is the kernel function that cal- Multiple kernel learning algorithms are pro-   posed to combine kernels in order to obtain culates a similarity measure between two data points. a better similarity measure or to integrate Using the theory of structural risk minimization, the feature representations coming from different model parameters can be found by solving a quadratic data sources. Most of the previous research programming problem, known as support vector ma- on such methods is focused on the computa- chine (SVM) (Vapnik, 1998). The model parameters tional efficiency issue. However, it is still not can also be interpreted as random variables to obtain feasible to combine many kernels using ex- a Bayesian interpretation of the model, known as rel- isting Bayesian approaches due to their high evance vector machine (RVM) (Tipping, 2001). time complexity. We propose a fully conju- Kernel selection (i.e., choosing a functional form and gate Bayesian formulation and derive a de- its parameters) is the most important issue that affects terministic variational approximation, which the empirical performance of kernel-based algorithms allows us to combine hundreds or thousands and is usually done using a cross-validation procedure. of kernels very efficiently. We briefly explain Multiple kernel learning (MKL) methods have been how the proposed method can be extended proposed to make use of multiple kernels simultane- for multiclass learning and semi-supervised ously instead of selecting a single kernel (see a recent learning. Experiments with large numbers survey by G¨onen & Alpaydın (2011)). Such methods of kernels on benchmark data sets show that also provide a principled way of integrating feature our inference method is quite fast, requiring representations coming from different data sources or less than a minute. On one bioinformatics modalities. Most of the previous research is focused and three image recognition data sets, our on developing efficient MKL algorithms. Nevertheless, method outperforms previously reported re- existing Bayesian MKL methods are problematic in sults with better generalization performance. terms of computation time when combining hundreds or thousands of kernels. In this paper, we formulate a very efficient Bayesian MKL method that solves this 1. Introduction issue by formulating the combination in a novel way. The main idea of kernel-based algorithms is to learn a In Section 2, we give an overview of the related work linear decision function in the feature space where data by considering existing discriminative and Bayesian points are implicitly mapped to using a kernel function MKL algorithms. Section 3 gives the details of the (Vapnik, 1998). Given a sample of N independent and proposed fully conjugate Bayesian formulation, called N identically distributed training instances {xi ∈X}i=1, Bayesian efficient multiple kernel learning (BEMKL). the decision function that is used to predict the target In Section 4, we explain detailed derivations of our de- output of an unseen test instance x⋆ can be written as terministic variational approximation for binary clas- ⊤ sification. Extensions towards multiclass learning and f(x⋆)= a k⋆ + b (1) semi- are summarized in Section 5. where the vector of weights assigned to each train- Section 6 evaluates BEMKL with large numbers of ker- ing data point and the bias are denoted by a and b, nels on standard benchmark data sets in terms of time complexity, and reports the classification results on Proceedings of the 29 th International Confer- Appearing in one bioinformatics and three image recognition tasks, ence on , Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). which are frequently used to compare MKL methods. Bayesian Efficient Multiple Kernel Learning

2. Related Work propose a very efficient method, called SMO-MKL, to train ℓp-norm (p> 1) MKL models with squared norm MKL algorithms basically replace the kernel in (1) as the regularization term using SMO algorithm for with a combined kernel calculated as a function of the solving MKL problem directly instead of solving inter- input kernels. The most common combination is to use mediate SVMs at each iteration. R P a weighted sum of P kernels {km : X×X→ }m=1: Most of the discriminative MKL algorithms are devel- P ⊤ oped for binary classification. One-versus-all or one- f(x⋆)= a emkm,⋆ +b versus-other strategies can be employed to get mul- m=1 ! X ticlass learners. However, there are also some di- ke,⋆ rect formulations for multiclass learning. Zien & Ong | {z } (2007) give a multiclass MKL algorithm by formu- where the vector of kernel weights is denoted by e lating the problem as an SILP and show that their ⊤ and km,⋆ = km(x1, x⋆) ... km(xN , x⋆) . Exist- method is equivalent to multiclass generalizations ing MKL algorithms with a weighted sum differ in of Bach et al. (2004) and Sonnenburg et al. (2006).   the way that they formulate restrictions on the kernel Gehler & Nowozin (2009) propose a boosting-type weights: arbitrary weights (i.e., linear sum), nonnega- MKL algorithm that combines outputs calculated from tive weights (i.e., conic sum), or weights on a simplex each kernel separately and obtain better results than (i.e., convex sum). MKL algorithms with SILP and SD approaches on im- Bach et al. (2004) formulate the problem as a second- age recognition problems. order cone programming (SCOP) problem, which is Girolami & Rogers (2005) present Bayesian MKL al- formulated as a semidefinite programming problem gorithms for regression and binary classification us- previously by Lanckriet et al. (2004). However, SCOP ing hierarchical models. Damoulas & Girolami (2008) can only be solved for medium-scale problems effi- give a multiclass MKL formulation using a very simi- ciently. Sonnenburg et al. (2006) reinterpret the prob- lar hierarchical model. The combined kernel in these lem as a semi-infinite linear programming (SILP) prob- two studies is defined as a convex sum of the input ker- lem, which can be applied to large-scale data sets. nels using a Dirichlet prior on the kernel weights. As Rakotomamonjy et al. (2008) develop a simple MKL a consequence of the nonconjugacy between Dirichlet algorithm using a sub- (SD) approach, and normal distributions, they choose to use an impor- which is faster than SILP method. Later, Xu et al. tance sampling scheme to update the kernel weights (2009) extend the level method, which is originally de- when deriving variational approximations. Recently, signed for optimizing non-smooth objective functions, Zhang et al. (2011) propose a fully Bayesian inference to obtain a very efficient MKL algorithm that carries methodology for extending generalized linear models flavors from both SILP and SD approaches but out- to kernelized models using a Markov chain Monte performs them in terms of computation time. Carlo (MCMC) approach. The main issue with these The aforementioned methods tend to produce sparse approaches is that they depend on some sampling kernel combinations, which corresponds to using the strategy and may not be trained in a reasonable time when the number of kernels is large. ℓ1-norm on the kernel weights. Sparsity at the ker- nel level may harm the generalization performance Girolami & Zhong (2007) formulate a Gaussian pro- of the learner and using non-sparse kernel combi- cess (GP) variant that uses multiple covariances (i.e., nations (e.g., the ℓ2-norm) may be a better choice kernels) for multiclass classification using a variational (Cortes et al., 2009). Varma & Babu (2009) propose a approximation or expectation propagation scheme, generalized MKL algorithm that can use any differen- which requires an MCMC sub-sampler for the covari- tiable and continuous regularization term on the kernel ance weights. Titsias & L´azaro-Gredilla (2011) pro- weights. This also allows us to integrate prior knowl- pose a multitask GP model that combines a common edge about the kernels to the model. Xu et al. (2010) set of GP functions (i.e., information sharing between and Kloft et al. (2011) independently and in parallel the tasks) defined over multiple covariances with task- develop an MKL algorithm with the ℓp-norm (p ≥ 1) dependent weights whose sparsity is tuned using the on the kernel weights. This method has a closed-form spike and slab prior. A variational approximation ap- update rule for the kernel weights and requires only proach is derived for an efficient inference scheme. an SVM solver for optimization. Sequential minimal optimization (SMO) algorithm is the most commonly Our main motivation for this work is to formulate an used method for solving SVM problems and efficiently efficient Bayesian inference approach without resorting scales to large problems. Vishwanathan et al. (2010) to expensive sampling procedures. Bayesian Efficient Multiple Kernel Learning

3. Bayesian Efficient Multiple Kernel As short-hand notations, all priors in the model are Learning denoted by Ξ = {γ, λ, ω}, where the remaining vari- ables by Θ = {a, b, e, f, G} and the hyper-parameters In order to obtain an efficient Bayesian MKL algo- by ζ = {αγ,βγ, αλ,βλ, αω,βω}. Dependence on ζ is rithm, we formulate a fully conjugate probabilistic omitted for clarity throughout the manuscript. model and develop a deterministic variational approx- imation mechanism for inference. We give the details The distributional assumptions of our proposed model for binary classification, but the same model can easily are defined as be extended to regression. λi ∼ G(λi; αλ,βλ) ∀i −1 Figure 1 illustrates the proposed probabilistic model ai|λi ∼ N (ai;0, λi ) ∀i for binary classification with a . The m m g |a, k ∼ N (g ; a⊤k , 1) ∀(m,i) main idea is to calculate intermediate outputs from i m,i i m,i each kernel using the same set of weight parameters γ ∼ G(γ; αγ ,βγ) and to combine these outputs using the kernel weights b|γ ∼ N (b;0,γ−1) and the bias to estimate the classification score. Differ- ωm ∼ G(ωm; αω,βω) ∀m ent from earlier probabilistic MKL approaches such as −1 Girolami & Rogers (2005) and Damoulas & Girolami em|ωm ∼ N (em;0,ωm ) ∀m ⊤ (2008), our method has two key properties that en- fi|b, e, gi ∼ N (fi; e gi + b, 1) ∀i able us to perform efficient inference: (a) Intermediate yi|fi ∼ δ(fiyi >ν) ∀i outputs calculated using the input kernels are intro- duced. (b) Kernel weights are assumed to be normally where the auxiliary variables between the intermediate distributed without enforcing any constraints on them. outputs and the class labels are introduced to make the inference procedures efficient (Albert & Chib, 1993), and the margin parameter ν is introduced to resolve b γ the scaling ambiguity issue and to place a low-density region between two classes, similar to the margin idea in SVMs, which is generally used for semi-supervised learning (Lawrence & Jordan, 2005). N (·; µ, Σ) rep- resents the normal distribution with the mean vector y Km G f µ and the covariance matrix Σ. G(·; α, β) denotes the P gamma distribution with the shape parameter α and the scale parameter β. δ(·) represents the Kronecker delta function that returns 1 if its argument is true and 0 otherwise. λ a e ω When we consider the random variables in our model as deterministic values, the auxiliary variable that cor- Figure 1. Bayesian efficient MKL for binary classification. responds to the decision function value in discrimina- tive methods can be decomposed as The notation we use throughout the rest of the P P manuscript is as follows: N and P represent the num- ⊤ m ⊤ f⋆ = e g + b = emg + b = ema km,⋆ + b bers of training instances and input kernels, respec- ⋆ ⋆ m=1 m=1 X X tively. The N × N kernel matrices are denoted by P Km, where the columns of Km by km,i and the rows ⊤ ⊤ i = a emkm,⋆ + b = a ke,⋆ + b of Km by km. The N × 1 vectors of weight param- m=1 ! eters a and their priors λ are denoted by a and λ, X i i where we see that the combined kernel in our model respectively. The P × N matrix of intermediate out- m can be expressed as a linear sum of the input kernels. puts gi is represented as G, where the columns of G m as gi and the rows of G as g . The bias parame- Note that sample-level sparsity can be tuned by assign- ter and its prior are denoted by b and γ, respectively. ing suitable values to the hyper-parameters (αλ,βλ) The P × 1 vectors of kernel weights em and their pri- as in RVMs (Tipping, 2001). Kernel-level sparsity ors ωm are denoted by e and ω, respectively. The can also be tuned by changing the hyper-parameters N × 1 vector of auxiliary variables fi is represented (αω,βω). Sparsity-inducing gamma priors can simu- as f. The N × 1 vector of associated class labels is late the ℓ1-norm on the kernel weights, whereas unin- represented as y, where each element yi ∈ {−1, +1}. formative priors simulate the ℓ2-norm. Bayesian Efficient Multiple Kernel Learning

4. Efficient Inference Using Variational can be found as Approximation P q(τ ) ∝ exp(Eq({Θ,Ξ}\τ )[log p(y, Θ, Ξ|{Km}m=1)]). Exact inference for our probabilistic model is in- For our model, thanks to the conjugacy, the resulting tractable and using a approach is com- approximate posterior distribution of each factor fol- putationally expensive (Gelfand & Smith, 1990). We lows the same distribution as the corresponding factor. instead formulate a deterministic variational approxi- mation, which is more efficient in terms of computa- 4.1. Inference Details tion time. The variational methods use a lower bound on the marginal likelihood using an ensemble of fac- The approximate posterior distributions of the preci- tored posteriors to find the joint parameter distribu- sion priors can be found as −1 tion (Beal, 2003). We can write the factorable ensem- N 2 1 1 ai ble approximation of the required posterior as q(λ)= G λi; αλ + , +  2 βλ 2  i=1 ! P e p(Θ, Ξ|{Km} , y) ≈ q(Θ, Ξ)= Y m=1  −1  q(λ)q(a)q(G)q(γ)q(ω)q(b, e)q(f) 1 1 b2 q(γ)= G γ; αγ + , +  2 βγ 2 !  and define each factor in the ensemble just like its full e conditional distribution:   −1 P 2 1 1 em N q(ω)= G ωm; αω + , +  2 βω 2  q(λ)= G(λ ; α(λ ),β(λ )) m=1 ! i i i Y f i=1 Y where the tilde notation denotes the posterior expec- q(a)= N (a; µ(a), Σ(a)) tations as usual, i.e., h(τ )=Eq(τ )[h(τ )]. N The approximate posterior distribution of the weight q(G)= N (gi; µ(gi), Σ(gi)) g i=1 parameters is a multivariate normal distribution: Y q(γ)= G(γ; α(γ),β(γ)) P ^m ⊤ P q(a)= N a; Σ(a) Km(g ) , q(ω)= G(ω ; α(ω ),β(ω )) m=1 ! m m m X m=1 P −1 Y ⊤ b diag(λ)+ KmK . (3) q(b, e)= N ; µ(b, e), Σ(b, e) m  e m=1 !    X N The approximate posteriore distribution of the interme- q(f)= T N (fi; µ(fi), Σ(fi),ρ(fi)) diate outputs can be found as a product of multivariate i=1 Y normal distributions: where α(·), β(·), µ(·), and Σ(·) denote the shape pa- ki rameter, the scale parameter, the mean vector, and N 1 . the covariance matrix for their arguments, respec- q(G)= N gi; Σ(gi) . a + fie − be, tively. T N (·; µ, Σ,ρ(·)) denotes the truncated nor- i=1 ki Y   P   mal distribution with the mean vector µ, the co-   e e e e  variance matrix Σ, and the truncation rule ρ(·) such −1 ⊤ that T N (·; µ, Σ,ρ(·)) ∝ N (·; µ, Σ) if ρ(·) is true and I + ee . T N (·; µ, Σ,ρ(·)) = 0 otherwise.    g  We can bound the marginal likelihood using Jensen’s The approximate posterior distribution of the bias and inequality: the kernel weights can be formulated as a multivariate normal distribution: P log p(y|{Km}m=1) ≥ P b 1⊤f Eq(Θ,Ξ)[log p(y, Θ, Ξ|{Km}m=1)] q(b, e)= N ; Σ(b, e) ,  e Gf − E Θ Ξ [log q(Θ, Ξ)] (2)   " # q( , ) e  −1 and optimize this bound by maximizing with respect γ + N e e 1⊤G⊤ to each factor separately until convergence. The ap- ^ (4) " G1 diag(ω)+ GG⊤#  proximate posterior distribution of a specific factor τ g e  e e Bayesian Efficient Multiple Kernel Learning where we allow kernel weights to take negative values. The predictive distribution of the auxiliary variable P If the kernel weights are restricted to be nonnegative, f⋆ can also be found by replacing p(b, e|{Km}m=1, y) the approximation becomes a truncated multivariate with its approximate posterior distribution q(b, e): distribution restricted to the nonnegative orthant ex- P cept for the first dimension, which is used for the bias. p(f⋆|g⋆, {Km}m=1, y)= 1 1 The approximate posterior distribution of the auxil- N f ; µ(b, e)⊤ , 1+ 1 g Σ(b, e) ⋆ g ⋆ g iary variables is a product of truncated normal distri-   ⋆  ⋆ butions given as   and the predictive distribution of the class label y⋆ can N be formulated using the auxiliary variable distribution: ⊤ q(f)= T N (fi; e gi + b, 1,fiyi >ν) P −1 µ(f⋆) − ν i=1 p(y = +1|{k , K } , y)= Z Φ Y ⋆ m,⋆ m m=1 ⋆ Σ(f ) f ⋆ where we need to find their posteriore e expectations to   update the approximate posterior distributions of the where Z⋆ is the normalization coefficient calculated for intermediate outputs, the bias, and the kernel weights. the test data point and Φ(·) is the standardized normal Fortunately, the truncated normal distribution has a cumulative distribution function. closed-form formula for its expectation. P ⊤ 5. Extensions m=1 KmKm in (3) should be cached before starting inference to reduce the computational complexity. (3) The proposed MKL method can be extended in sev- P requires inverting an N × N matrix for the covariance eral directions. We shortly explain two variants for calculation, whereas (4) requires inverting a (P + 1) × multiclass learning and semi-supervised learning. (P + 1) matrix. One of these two updates dominates the running time depending on whether N > P . 5.1. Multiclass Learning

4.2. Convergence In multiclass learning, we consider classification prob- lems with more than two classes. The easiest way is The inference mechanism sequentially updates the ap- to train a distinct classifier for each class that sep- proximate posterior distributions of the model param- arates this particular class from the remaining (i.e., eters and the latent variables until convergence, which one-versus-all classification). In this setup, for each can be checked by monitoring the lower bound in (2). classifier, we learn a different combined kernel func- The first term of the lower bound corresponds to the tion, which corresponds to using a different similarity sum of exponential form expectations of the distribu- measure. Instead, we can have different classifiers but tions in the joint likelihood. The second term is the use the same set of kernel weights for each of them sum of negative entropies of the approximate poste- (Rakotomamonjy et al., 2008). The inference details riors in the ensemble. The only nonstandard distri- of our model with such an approach can be found in bution in these terms is the truncated normal distri- the supplementary material. Another possibility is to bution used for the auxiliary variables; nevertheless, use multinomial probit in our model for multiclass clas- the truncated normal distribution has a closed-form sification as done by Damoulas & Girolami (2008). formula also for its entropy. Exact form of the vari- ational lower bound can be found in the supplemen- 5.2. Semi-Supervised Learning tary material available at http://users.ics.aalto. fi/gonen/bemkl/. In kernel-based discriminative learning framework, semi-supervised learning can be formulated as an in- 4.3. Prediction teger programming problem, known as transductive SVMs (Vapnik, 1998). Even though there are some P We can replace p(a|{Km}m=1, y) with its approximate approximation methods, the computational complex- posterior distribution q(a) and obtain the predictive ity of such models are significantly higher than that distribution of the intermediate outputs g⋆ for a new of fully supervised models. Considering this complex- data point as ity issue, it may not be feasible to use discriminative MKL algorithms for semi-supervised learning. How- P p(g⋆|{km,⋆, Km}m=1, y)= ever, our proposed model can be modified to make P use of unlabeled data with a slight increase in compu- m ⊤ ⊤ N (g⋆ ; µ(a) km,⋆, 1+ km,⋆Σ(a)km,⋆). tational complexity using the low-density assumption m=1 Y (Lawrence & Jordan, 2005). Bayesian Efficient Multiple Kernel Learning 6. Experiments Table 1. Experiments on benchmark data sets. The figures We first test our new algorithm BEMKL on benchmark are averages and standard deviations over 20 replications. N P data sets to show its computational efficiency. We then Here, and denote the numbers of training instances illustrate its generalization performance comparing it and input kernels, respectively. with previously reported MKL results on one bioin- sparse non-sparse formatics and three image recognition data sets. We implement the proposed variational approximation for breast N = 478 P = 130 BEMKL in Matlab and our implementation is avail- Training Time (sec) 18.86±0.24 18.84±0.43 able at http://users.ics.aalto.fi/gonen/bemkl/. Test Accuracy (%) 96.80±1.02 96.98±0.86 Selected Kernel (#) 34.35±2.46 98.95±1.32 6.1. Experiments on Benchmark Data Sets bupaliver N = 241 P = 91 Our first set of experiments is designed to evaluate Training Time (sec) 3.83±0.03 3.81±0.04 the running times of BEMKL. The experiments are Test Accuracy (%) 68.41±4.28 70.72±3.83 performed on eight data sets from the UCI repository: Selected Kernel (#) 18.40±2.62 70.70±1.13 breast, bupaliver, heart, ionosphere, pima, sonar, wdbc, and wpbc. For each data set, we take 20 replica- heart N = 189 P = 377 tions, where we randomly select 70 per cent of the data Training Time (sec) 13.69±0.08 13.58±0.09 set as the training set and use the remaining as the test Test Accuracy (%) 80.80±4.01 81.36±4.70 set. The training set is normalized to have zero mean Selected Kernel (#) 41.85±4.26 181.40±8.93 and unit standard deviation, and the test set is then ionosphere N = 245 P = 442 normalized using the mean and the standard deviation of the training set. Training Time (sec) 22.72±0.05 22.76±0.12 Test Accuracy (%) 92.03±1.72 92.03±1.77 We construct Gaussian kernels with 10 different widths Selected Kernel (#) 41.90±4.42 219.05±7.77 ({2−3, 2−2,..., 26}) and polynomial kernels with three different degrees ({1, 2, 3}) on all features and on pima N = 537 P = 117 each single feature, following the experimental set- Training Time (sec) 21.15±0.23 20.94±0.22 tings of Rakotomamonjy et al. (2008), Xu et al. (2009; Test Accuracy (%) 75.02±2.28 74.96±2.08 2010), and Vishwanathan et al. (2010). All kernel ma- Selected Kernel (#) 23.20±2.02 79.55±2.93 trices are normalized to have unit diagonal entries (i.e., spherical normalization) and precomputed before run- sonar N = 144 P = 793 ning the inference algorithm. The experiments are per- Training Time (sec) 51.34±0.19 51.51±0.10 formed on a PC with 3.0GHz CPU and 4GB memory. Test Accuracy (%) 76.88±3.95 82.81±4.12 We run BEMKL for two different sparsity levels on Selected Kernel (#) 15.30±3.40 372.80±7.78 the kernel weights: sparse and non-sparse. The hyper- wdbc N = 398 P = 403 parameter values for these scenarios are selected as −10 +10 (αλ,βλ, αγ ,βγ, αω,βω) = (1, 1, 1, 1, 10 , 10 ) and Training Time (sec) 41.31±0.21 41.44±0.16 (αλ,βλ, αγ ,βγ, αω,βω) = (1, 1, 1, 1, 1, 1), respectively. Test Accuracy (%) 95.70±2.32 95.76±1.65 We take 200 iterations for variational inference scheme. Selected Kernel (#) 35.65±2.28 215.50±4.81 Table 1 lists the results obtained by BEMKL with two wpbc N = 135 P = 429 scenarios on benchmark data sets in terms of three dif- Training Time (sec) 12.72±0.11 12.67±0.23 ferent measures: the training time in seconds, the test Test Accuracy (%) 75.25±1.01 73.64±2.66 accuracy, and the number of selected kernels. We see Selected Kernel (#) 28.25±2.75 220.10±9.83 that our deterministic variational approximation for BEMKL takes less than a minute with large numbers of kernels, ranging from 91 to 793. The classification 6.2. Comparison on MKL Data Sets accuracy difference between sparse and non-sparse ker- nel combination is quite obvious for some data sets We use four data sets that are previously used to com- such as sonar in accordance with the previous stud- pare MKL algorithms and have kernel matrices avail- ies. Using sparsity inducing priors on kernel weights able for direct evaluation. Note that the results used clearly simulates the ℓ1-norm by eliminating most of for comparison may not be the best results reported on the input kernels, whereas using uninformative priors these data sets but we use the exact same kernel ma- picks much more kernels by simulating the ℓ2-norm. trices that produce these results for our algorithm to Bayesian Efficient Multiple Kernel Learning

Table 2. Performance comparison on Protein data set. Table 4. Performance comparison on Flowers102 data set.

Method Test Accuracy Method AUC EER Accuracy Damoulas & Girolami (2008) 68.1±1.2 Titsias & L´azaro-Gredilla 0.952 0.107 40.0 BEMKL (one-versus-all) 71.5±0.1 (2011) BEMKL (multiclass) 71.2±0.2 BEMKL (one-versus-all) 0.969 0.068 67.0 BEMKL (multiclass) 0.969 0.069 68.9 Table 3. Performance comparison on Flowers17 data set. Table 5. Performance comparison on Caltech101 data set. Method Test Accuracy Method Test Accuracy Gehler & Nowozin (2009) 85.5±3.0 BEMKL (one-versus-all) 85.6±1.2 Varma & Babu (2009) 71.1±0.6 BEMKL (multiclass) 85.9±1.2 BEMKL (one-versus-all) 72.7±0.1 BEMKL (multiclass) 73.1±0.1 have comparable performance measures. These data sets have small numbers of kernels available and we is better than learning separate sets of kernel weights do not force any sparsity on the kernel weights using as also observed by Gehler & Nowozin (2009). an uninformative Gamma prior in accordance with the previous studies. All data sets consider multiclass clas- 6.2.3. Oxford Flowers102 Data Set sification problems and we report both one-versus-all Flowers102 data set contains flower images from 102 and multiclass results for BEMKL. different types with more than 40 images per class 6.2.1. Protein Fold Recognition Data Set and it is available at http://www.robots.ox.ac.uk/ ~vgg/data/flowers/102/. There is a predefined split Protein data set considers 27 different fold types of consisting of 2040 training and 6149 testing images. 694 proteins (311 for training and 383 for testing) There are four precomputed distance matrices over and it is available at http://mkl.ucsd.edu/dataset/ different feature representations. These matrices are protein-fold-prediction/. There are 12 distinct converted into kernels using the same procedure on feature representations summarizing protein charac- Flowers17 data set. Table 4 shows the classification teristics. We construct 12 kernels on these representa- results on Flowers102 data set. We report averages tions and take 20 replications using the experimental of area under ROC curve (AUC) and equal error rate procedure of Damoulas & Girolami (2008). Table 2 (EER) values calculated for each class in addition to gives the classification results on Protein data set. We multiclass accuracy. We see that BEMKL outperforms see that BEMKL is significantly better than the proba- the GP-based method of Titsias & L´azaro-Gredilla bilistic MKL method of Damoulas & Girolami (2008). (2011) in all metrics on this challenging task.

6.2.2. Oxford Flowers17 Data Set 6.2.4. Caltech101 Data Set

Flowers17 data set contains flower images from 17 Caltech101 data set consists of object images from different types with 80 images per class and it is avail- 102 different categories and it is available at http:// able at http://www.robots.ox.ac.uk/~vgg/data/ www.robots.ox.ac.uk/~vgg/software/MKL/. There flowers/17/. It also provides three predefined splits are three predefined splits with 15 images for training with 60 images for training and 20 images for test- and 15 images for testing from each class along with ing from each class. There are seven precomputed 10 precomputed kernel matrices. The classification re- distance matrices over different feature representa- sults on Caltech101 data set are given in Table 5. tions. These matrices are converted into kernels as BEMKL achieves higher average test accuracy with k(xi, xj ) = exp(−d(xi, xj )/s) where s is the mean smaller standard deviation across splits than the dis- distance between training point pairs. The classifica- criminative MKL algorithm of Varma & Babu (2009). tion results on Flowers17 data set are shown in Ta- Similar to the results on Flowers17 and Flowers102, ble 3. BEMKL achieves higher average test accuracy we see that using the same set of kernel weights for with smaller standard deviation across splits than the each class instead of separate sets is better in terms of boosting-type MKL algorithm of Gehler & Nowozin generalization performance. This approach allows the (2009). We also see that using the same set of kernel classifiers trained for each class to share information weights for each class as in our multiclass formulation with others similar to multitask learning. Bayesian Efficient Multiple Kernel Learning

7. Conclusions Girolami, M. and Rogers, S. Hierarchic Bayesian models for kernel learning. In Proceedings of the 22nd Interna- In this paper, we introduce a Bayesian multiple ker- tional Conference on Machine Learning, 2005. nel learning framework in order to have computation- Girolami, M. and Zhong, M. Data integration for classifi- ally feasible algorithms by formulating the combina- cation problems employing Gaussian process priors. In tion in a novel way. This formulation allows us to Advances in Neural Processing Systems 19, 2007. develop fully conjugate probabilistic models and to de- G¨onen, M. and Alpaydın, E. Multiple kernel learning al- rive very efficient deterministic variational approxima- gorithms. Journal of Machine Learning Research, 12: tions for inference. We give the inference details for 2211–2268, 2011. binary classification only due to space limitation and Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A. ℓp- explain briefly how the method can be extended to norm multiple kernel learning. Journal of Machine multiclass learning and semi-supervised learning. An- Learning Research, 12:953–997, 2011. other interesting direction is to formulate a multitask Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, learning method by enforcing similarity between the L. E., and Jordan, M. I. Learning the kernel matrix with kernel weights of different tasks. semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004. Experimental results on benchmark binary classifica- tion data sets show that the proposed method can Lawrence, N. D. and Jordan, M. I. Semi-supervised learn- combine hundreds of kernels in less than a minute. We ing via Gaussian processes. In Advances in Neural In- formation Processing Systems 17, 2005. also report very good results on one bioinformatics and three image recognition data sets, which contain mul- Rakotomamonjy, A., Bach, F. R., Canu, S., and Grand- ticlass classification problems, compared to previously valet, Y. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008. reported results. Sonnenburg, S., R¨atsch, G., Sch¨afer, C., and Sch¨olkopf, B. Acknowledgments. This work was financially sup- Large scale multiple kernel learning. Journal of Machine ported by the Academy of Finland (Finnish Centre Learning Research, 7:1531–1565, 2006. of Excellence in Computational Inference Research Tipping, M. E. Sparse Bayesian learning and the relevance COIN, grant no 251170). vector machine. Journal of Machine Learning Research, 1:211–244, 2001. References Titsias, M. K. and L´azaro-Gredilla, M. Spike and slab variational inference for multi-task and multiple kernel Albert, J. H. and Chib, S. Bayesian analysis of binary and learning. In Advances in Neural Information Processing polychotomous response data. Journal of the American Systems 24, 2011. Statistical Association, 88:669–679, 1993. Vapnik, V. N. Statistical Learning Theory. John Wiley & Bach, F. R., Lanckriet, G. R. G., and Jordan, M. I. Mul- Sons, New York, 1998. tiple kernel learning, conic duality, and the SMO algo- rithm. In Proceedings of the 21st International Confer- Varma, M. and Babu, B. R. More generality in efficient ence on Machine Learning, 2004. multiple kernel learning. In Proceedings of the 26th In- ternational Conference on Machine Learning, 2009. Beal, M. J. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, The Gatsby Compu- Vishwanathan, S. V. N., Sun, Z., Theera-Ampornpunt, N., tational Neuroscience Unit, University College London, and Varma, M. Multiple kernel learning and the SMO 2003. algorithm. In Advances in Neural Processing Systems 23, 2010. Cortes, C., Mohri, M., and Rostamizadeh, A. L2 reg- ularization for learning kernels. In Proceedings of the Xu, Z., Jin, R., King, I., and Lyu, M. R. An extended 25th Conference on Uncertainty in Artificial Intelli- level method for efficient multiple kernel learning. In gence, 2009. Advances in Neural Information Processing Systems 21, 2009. Damoulas, T. and Girolami, M. A. Probabilistic multi- class multi-kernel learning: On protein fold recognition Xu, Z., Jin, R., Yang, H., King, I., and Lyu, M. R. Simple and remote homology detection. Bioinformatics, 24(10): and efficient multiple kernel learning by group Lasso. 1264–1270, 2008. In Proceedings of the 27th International Conference on Machine Learning, 2010. Gehler, P. and Nowozin, S. On feature combination for multiclass object classification. In Proceedings of Zhang, Z., Dai, G., and Jordan, M. I. Bayesian general- the 12th International Conference on Computer Vision, ized kernel mixed models. Journal of Machine Learning 2009. Research, 12:111–139, 2011. Gelfand, A. E. and Smith, A. F. M. Sampling-based ap- Zien, A. and Ong, C. S. Multiclass multiple kernel learning. proaches to calculating marginal densities. Journal of In Proceedings of the 24th International Conference on the American Statistical Association, 85:398–409, 1990. Machine Learning, 2007.