Multiclass Multiple Kernel Learning
Total Page:16
File Type:pdf, Size:1020Kb
Multiclass Multiple Kernel Learning Alexander Zien [email protected] Cheng Soon Ong [email protected] Max Planck Inst. for Biol. Cybernetics and Friedrich Miescher Lab., Spemannstr. 39, T¨ubingen, Germany. Abstract evant and meaningful features [2, 13, 20]. In many applications it is desirable to learn Since in many real world applications more than two from several kernels. “Multiple kernel learn- classes are to be distinguished, there has been a lot of ing” (MKL) allows the practitioner to opti- work on decomposing multiclass problems into several mize over linear combinations of kernels. By standard binary classification problems, or developing enforcing sparse coefficients, it also general- genuine multiclass classifiers. The latter has so far izes feature selection to kernel selection. We been restricted to single kernels. In this paper we de- propose MKL for joint feature maps. This velop and investigate a multiclass extension of MKL. provides a convenient and principled way for We provide: MKL with multiclass problems. In addition, we can exploit the joint feature map to learn • An intuitive formulation of the multiclass MKL kernels on output spaces. We show the equiv- task, with a new sparsity-promoting regularizer. alence of several different primal formulations • A proof of equivalence with (the multiclass exten- including different regularizers. We present sion of) a previous MKL formulation. several optimization methods, and compare • Several optimization approaches, with a compu- a convex quadratically constrained quadratic tational comparison between three of them. program (QCQP) and two semi-infinite linear • Experimental results on several datasets demon- programs (SILPs) on toy data, showing that strating the method’s utility. the SILPs are faster than the QCQP. We then demonstrate the utility of our method by ap- 2. Multiclass Multiple Kernel Learning plying the SILP to three real world datasets. A common approach (e.g. [21]) to multiclass classifi- cation is the use of joint feature maps Φ(x, y) on data 1. Introduction X and labels Y = {1, . , m}, m > 2, together with a linear output function In support vector machines (SVMs), a kernel func- tion k implicitly maps examples x to a feature space fw,b(x, y) = hw, Φ(x, y)i + by , (1) given by a feature map Φ via the identity k(xi, xj) = parameterized with the hyperplane normal w and bi- hΦ(xi), Φ(xj)i (e.g. [19]). It is often unclear what the ases b. Here, b is short for the stacked variables most suitable kernel for the task at hand is, and hence (b1, . , bm); we will use analoguous notation through- the user may wish to combine several possible kernels. out the paper. The predicted class y for a point x is One problem with simply adding kernels is that using chosen to maximize the output, uniform weights is possibly not optimal. An extreme example is the case that one kernel is not correlated x 7→ arg max fw,b(x, y). y∈Y with the labels at all – then giving it positive weight just adds noise [13]. Multiple kernel learning (MKL) Hence for a suitable convex loss function `, training is a way of optimizing kernel weights while training can be implemented by the following convex multi- the SVM. In addition to leading to good classification class support vector machine (m-SVM) optimization accuracies, MKL can also be useful for identifying rel- problem (OP): n Appearing in Proceedings of the 24 th International Confer- 1 2 X min kwk + max {` (fw,b(xi, yi) − fw,b(xi, u))} . ence on Machine Learning, Corvallis, OR, 2007. Copyright w,b 2 u6=yi i=1 2007 by the author(s)/owner(s). (2) Multiclass Multiple Kernel Learning Here n is the size of the training set, and u 6= yi Their non-negativity guarantees that the combined is short for u ∈ Y − {yi}. For the hinge loss, regularizer is convex, and the resulting kernel is posi- `(t) := C max (0, 1 − t), the dual of this is a well-kown tive semidefinite. Further, without limiting the norm quadratic program (QP) [21]. of β, the regularizer on w would not be effective: it could be driven to zero without changing fw,b,β by 2.1. Multiple Kernel Learning (MKL) Primal dividing w by any positive scalar while multiplying β with the same number. We can generalize the above m-SVM further to operate on p 1 feature maps Φ (x , y ) (for k = 1, . , p). > k i i 2.2. MKL Dual (General and Hinge Loss) For each feature map there will be a separate weight vector wk. Here we consider linear combinations of Key to the findings and algorithms in this paper is the the corresponding output functions: dual of the MKL optimization problem. Using a more p general proof technique than [20] (cf. appendix), we X can find the dual of (4) for any convex loss function ` fw,b,β(x, y) = βk hwk, Φk(x, y)i + by . k=1 without having to require differentiability. This corresponds to direct sums of feature spaces of the ˜ Theorem 1 Let `∗ be the conjugate function of the form Φ(xi, yi) = (βkΦk(xi, yi))k=1,...,p. The mixing coefficients β should reflect the utility of the respective given convex loss function `, and δab be the indicator feature map for the classification task. function of a = b. The dual of the convex MKL prob- lem (4) is equivalent to We aim at choosing w = (wk)k=1,...,p and β such that α˜ fw,b,β(xi, yi) fw,b,β(xi, u) for all u ∈ Y − {yi}. The X X ∗ iu > inf γ + ηiu` − resulting OP, a generalization of (2), can be written as η,α˜,γ ηiu i u6=yi 1 X X X p n s.t. ∀k : γ α˜ α˜ hΨ , Ψ i , 1 X 2 X > 2 iu jv kiu kjv min βkkwkk + ξi i,j u6=y v6=y β,w,b,ξ 2 i j k=1 i=1 X ∀i : ∀u 6= yi : 0 ηiu, and ∀i : 1 = ηiu, s.t. ∀i : ξi = max ` (fw,b,β(xi, yi) − fw,b,β(xi, u)) 6 u6=yi u6=yi (3) X X X ∀v : 0 = (1 − δyiv)˜αiv − δyiv α˜iu where we regularize the p output functions according i i u6=yi to their weights βk. In this paper the weights are un- (5) derstood to be on standard simplex, i.e. with γ ∈ R, α˜ ∈ Rn×(m−1), η ∈ Rn×(m−1). Observe ( p ) that the weight vector of the primal is obtained by p X β ∈ ∆ := β βk = 1, ∀k : 0 ≤ βk , k=1 X X wk = wk(α˜) := α˜iuΨkiu . (6) giving them the flavor of probabilities. This L1 regu- i u6=yi larizer on β promotes sparsity, and hence we are trying to select a subset of kernels. Now we plug the standard SVM loss function, the While the OP (3) is interpretable and intuitive, it has hinge loss `(t) := C max (0, 1 − t), into (5). We ob- the disadvantage of in general not being convex due tain: to the products of βk and wk in the output function. X X min γ − α˜iu We thus apply a change of variables transformation α˜,γ (cf. [4, Section 4.1.3]) with v := β w . For a convex i u6=yi k k k 1 loss, the resulting OP (below) is convex. s.t. ∀k : γ kw (α˜)k2 > 2 k p n X α˜iu ≤ C ∀u 6= yi : 0 ≤ α˜iu 1 X 2 X inf kvkk /βk + ξi u6=y β,v,b,ξ 2 i k=1 i=1 (4) X X X ∀v : 0 = (1 − δyiv)˜αiv − δyiv α˜iu s.t. ∀i : ξi = max ` hv, Ψiui + byi − bu u6=yi i i u6=yi (7) where we define Ψ = Φ (x , y ) − Φ (x , u) and n×m kiu k i i k i By introducing α ∈ R via the substitution Ψiu = (Ψkiu)k=1,...,p. Beyond easing the interpretation of the weights, the −α˜ if u 6= y α = iu i (8) constraints on β are essential for technical reasons. iu P α˜ if u = y v6=yi iv i Multiclass Multiple Kernel Learning we can equivalently rewrite equation (7) into chain of equivalences, all shown OPs are equivalent, despite their different regularizers. X min γ − αiy α,γ i i 2.4. Optimization 1 s.t. ∀k : γ kw (α)k2 > 2 k Recently unconstrained primal optimization is emerg- ∀i : 0 α C 6 iyi 6 ing as a promising machine learning technique. For ∀i : ∀u 6= y : α 0 α ∈ S := α i iu 6 MKL this approach might be most convenient with ∀i : P α = 0 u∈Y iu the OP from [1] (cf. Figure 1), as it does not include P ∀u ∈ Y : i αiu = 0 β and the associated constraints. However, here we (9) follow the traditional MKL route along the dual. with the correspondingly transformed, “unfolded”, ex- pansion for the hyperplane normal, One possibility is to solve either version of the dual directly using an off-the-shelf solver. Equations (7) X X and (9) are QCQPs. In general, a QCQP can be solved wk = αiuΦk(xi, u) . (10) i y∈Y much more efficiently than an SDP with interior point methods due to the added structure of the problem [4]. Both versions of the dual, the compact (7) and the In Section 3.1 we investigate this approach utilizing the unfolded (9), are quadratically constrained quadratic commercial optimization package CPLEX that offers programs (QCQPs).