Localized Algorithms for Multiple Kernel Learning

Pattern Recognition 46 (2013) 795–807 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Localized algorithms for multiple kernel learning Mehmet Gonen¨ n, Ethem Alpaydın Department of Computer Engineering, Bogazic˘ -i University, TR-34342 Bebek, Istanbul,_ Turkey article info abstract Article history: Instead of selecting a single kernel, multiple kernel learning (MKL) uses a weighted sum of kernels Received 20 September 2011 where the weight of each kernel is optimized during training. Such methods assign the same weight to Received in revised form a kernel over the whole input space, and we discuss localized multiple kernel learning (LMKL) that is 28 March 2012 composed of a kernel-based learning algorithm and a parametric gating model to assign local weights Accepted 2 September 2012 to kernel functions. These two components are trained in a coupled manner using a two-step Available online 11 September 2012 alternating optimization algorithm. Empirical results on benchmark classification and regression data Keywords: sets validate the applicability of our approach. We see that LMKL achieves higher accuracy compared Multiple kernel learning with canonical MKL on classification problems with different feature representations. LMKL can also Support vector machines identify the relevant parts of images using the gating model as a saliency detector in image recognition Support vector regression problems. In regression tasks, LMKL improves the performance significantly or reduces the model Classification Regression complexity by storing significantly fewer support vectors. Selective attention & 2012 Elsevier Ltd. All rights reserved. 1. Introduction us to obtain the following dual formulation: XN 1 XN XN Support vector machine (SVM) is a discriminative classifier max: a À a a y y kðx ,x Þ i 2 i i i j i j based on the theory of structural risk minimization [33]. Given i ¼ 1 i ¼ 1 j ¼ 1 a sample of independent and identically distributed training w:r:t: aA½0,CN N A D A instances fðxi,yiÞgi ¼ 1, where xi R and yi f1, þ1g is its class XN label, SVM finds the linear discriminant with the maximum s:t: aiyi ¼ 0 margin in the feature space induced by the mapping function i ¼ 1 ðÞ F . The discriminant function is where a is the vector of dual variables corresponding to each f ðxÞ¼/w,FðxÞSþb separation constraint and the obtained kernel matrix of kðxi,xjÞ¼ / S PFðxiÞ,FðxjÞ is positive semidefinite. Solving this, we get w ¼ whose parameters can be learned by solving the following quad- N aiy FðxiÞ and the discriminant function can be written as ratic optimization problem: i ¼ 1 i XN XN 1 f ðxÞ¼ a y kðx ,xÞþb: min: JwJ2 þC x i i i 2 2 i i ¼ 1 i ¼ 1 A S A N A w:r:t: w R , n R þ , b R There are several kernel functions successfully used in the / S s:t: yið w,FðxiÞ þbÞZ1Àxi 8i literature such as the linear kernel (kL), the polynomial kernel (kP), and the Gaussian kernel (k ) where w is the vector of weight coefficients, S is the dimension- G ality of the feature space obtained by FðÞ, C is a predefined kLðxi,xjÞ¼/xi,xjS positive trade-off parameter between model simplicity and clas- q kPðxi,xjÞ¼ð/xi,xjSþ1Þ qAN sification error, n is the vector of slack variables, and b is the bias k ðx ,x Þ¼expðJx Àx J2=s2Þ sA : term of the separating hyperplane. Instead of solving this opti- G i j i j 2 R þþ mization problem directly, the Lagrangian dual function enables There are also kernel functions proposed for particular applications, such as natural language processing [24] and bioinformatics [31]. n Selecting the kernel function kð,Þ and its parameters (e.g., q or s) Corresponding author. E-mail addresses: [email protected] (M. Gonen),¨ is an important issue in training. Generally, a cross-validation [email protected] (E. Alpaydın). procedure is used to choose the best performing kernel function 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.09.002 796 M. Gonen,¨ E. Alpaydın / Pattern Recognition 46 (2013) 795–807 among a set of kernel functions on a separate validation set combined kernel by adding a new kernel as training continues [5,9]. different from the training set. In recent years, multiple kernel In a trained combiner parameterized by H, if we assume H to learning (MKL) methods are proposed, where we use multiple contain random variables with a prior, we can use a Bayesian kernels instead of selecting one specific kernel function and its approach. For the case of a weighted sum, we can, for example, corresponding parameters have a prior on the kernel weights [11,12,28]. A recent survey of multiple kernel learning algorithms is given in [18]. k ðx ,x Þ¼f ðfk ðxm,xmÞgP Þð1Þ Z i j Z m i j m ¼ 1 This paper is organized as follows: We formulate our proposed where the combination function f ZðÞ can be a linear or a non- nonlinear combination method localized MKL (LMKL) with detailed P linear function of the input kernels. Kernel functions, fkmð,Þgm ¼ 1, mathematical derivations in Section 2. We give our experimental take P feature representations (not necessarily different) of data results in Section 3 where we compare LMKL with MKL and single m P m A Dm kernel SVM. In Section 4, we discuss the key properties of our instances, where xi ¼fxi gm ¼ 1, xi R , and Dm is the dimen- sionality of the corresponding feature representation. proposed method together with related work in the literature. We The reasoning is similar to combining different classifiers: conclude in Section 5. Instead of choosing a single kernel function and putting all our eggs in the same basket, it is better to have a set and let an algorithm do the picking or combination. There can be two uses 2. Localized multiple kernel learning of MKL: (i) Different kernels correspond to different notions of similarity and instead of trying to find which works best, a Using a fixed unweighted or weighted sum assigns the same learning method does the picking for us, or may use a combina- weight to a kernel over the whole input space. Assigning different tion of them. Using a specific kernel may be a source of bias, and weights to a kernel in different regions of the input space may in allowing a learner to choose among a set of kernels, a better produce a better classifier. If the data has underlying local structure, solution can be found. (ii) Different kernels may be using inputs different similarity measures may be suited in different regions. We coming from different representations possibly from different propose to divide the input space into regions using a gating function sources or modalities. Since these are different representations, and assign combination weights to kernels in a data-dependent they have different measures of similarity corresponding to different way [13]; in the neural network literature, a similar architecture is kernels. In such a case, combining kernels is one possible way to previously proposed under the name ‘‘mixture of experts’’ [20,3]. The combine multiple information sources. discriminant function for binary classification is rewritten as Since their original conception, there is significant work on the XP 9 / m S theory and application of multiple kernel learning. Fixed rules use f ðxÞ¼ Zmðx VÞ wm,Fmðx Þ þb ð2Þ the combination function in (1) as a fixed function of the kernels, m ¼ 1 without any training. Once we calculate the combined kernel, we 9 where Zmðx VÞ is a parametric gating model that assigns a weight to train a single kernel machine using this kernel. For example, we m Fmðx Þ as a function of x and V is the matrix of gating model can obtain a valid kernel by taking the summation or multi- parameters. Note that unlike in MKL, in LMKL, it is not obligatory to plication of two kernels as follows [10]: combine different feature spaces; we can also use multiple copies of 1 1 2 2 the same feature space (i.e., kernel) in different regions of the input kZðxi,xjÞ¼k1ðxi ,xj Þþk2ðxi ,xj Þ space and thereby obtain a more complex discriminant function. For k ðx ,x Þ¼k ðx1,x1Þk ðx2,x2Þ: Z i j 1 i j 2 i j example, as we will see shortly, we can combine multiple linear The summation rule is applied successfully in computational kernels to get a piecewise linear discriminant. biology [27] and optical digit recognition [25] to combine two or more kernels obtained from different representations. 2.1. Gating models Instead of using a fixed combination function, we can have a function parameterized by a set of parameters H and then we In order to assign kernel weights in a data-dependent way, we have a learning procedure to optimize H as well. The simplest use a gating model. Originally, we investigated the softmax gating case is to parameterize the sum rule as a weighted sum model [13] XP / GS m m expð vm,x þvm0Þ kZðxi,xj9H ¼ gÞ¼ Z kmðx ,x Þ Z ðx9VÞ¼P 8m ð3Þ m i j m P / GS m ¼ 1 h ¼ 1 expð vh,x þvh0Þ A G A DG with Zm R. Different versions of this approach differ in the way where x R is the representation of the input instance in the they put restrictions on the kernel weights [22,4,29,19]. For feature space in which we learn the gating model and VARPðDG þ 1Þ P example, we can use arbitrary weights (i.e., linear combination), contains the gating model parameters fvm,vm0gm ¼ 1.

Load more