Learning output kernels with block coordinate descent

Francesco Dinuzzo [email protected] Max Planck Institute for Intelligent Systems, 72076 T¨ubingen,Germany Cheng Soon Ong [email protected] Department of Computer Science, ETH Z¨urich, 8092 Z¨urich, Switzerland Peter Gehler [email protected] Max Planck Institute for Informatics, 66123 Saarbr¨ucken, Germany Gianluigi Pillonetto [email protected] Department of Information Engineering, University of Padova, 35131 Padova, Italy

Abstract the outputs. In this paper, we introduce a method that We propose a method to learn simultane- simultaneously learns a vector-valued function and the ously a vector-valued function and a kernel kernel between the components of the output vector. between its components. The obtained ker- The problem is formulated within the framework of nel can be used both to improve learning per- regularization in RKH spaces. We assume that the formance and to reveal structures in the out- matrix-valued kernel can be decomposed as the prod- put space which may be important in their uct of a scalar kernel and a positive semidefinite kernel own right. Our method is based on the so- matrix that represents the similarity between the out- lution of a suitable regularization problem put components, a structure that has been considered over a reproducing kernel Hilbert space of in a variety of works, (Evgeniou et al., 2005; Capon- vector-valued functions. Although the regu- netto et al., 2008; Baldassarre et al., 2010). larized risk functional is non-convex, we show In practice, an important issue is the choice of the ker- that it is invex, implying that all local min- nel matrix between the outputs. In some cases, one imizers are global minimizers. We derive a may be able to fix it by using prior knowledge about block-wise coordinate descent method that the relationship between the components. However, efficiently exploits the structure of the objec- such prior knowledge is not available in the vast ma- tive functional. Then, we empirically demon- jority of the applications, therefore it is important to strate that the proposed method can improve develop data-driven tools to choose the kernel auto- classification accuracy. Finally, we provide matically. The learned kernel matrix can be also used a visual interpretation of the learned kernel for visualization purposes, revealing interesting struc- matrix for some well known datasets. tures in the output space. For instance, when applying the method to a multiclass classification problem, the learned kernel matrix can be used to cluster the classes 1. Introduction into homogeneous groups. Previous works, such as Classical learning tasks such as binary classification (Zien & Ong, 2007; Lampert & Blaschko, 2008), have and regression can be solved by synthesizing a real- shown that learning convex combinations of kernels on valued function from the data. More generally, prob- the output space can improve performance. lems such as multiple output regression, multitask Our method searches over the whole cone of positive learning, multiclass and multilabel classification can semidefinite matrices. Although the resulting opti- be addressed by first learning a vector-valued func- mization problem is non-convex, we prove that the ob- tion, and then applying a suitable transformation to jective is an invex function (Mishra & Giorgi, 2008), Appearing in Proceedings of the 28 th International Con- namely it has the property that stationary points ference on Machine Learning, Bellevue, WA, USA, 2011. are global minimizers. Then, we propose an efficient Copyright 2011 by the author(s)/owner(s). block-wise coordinate descent algorithm to compute Learning output kernels with block coordinate descent an optimal solution. The algorithm alternates between It turns out that every RKHS of Y-valued functions H the solution of a so-called discrete time Sylvester equa- can be associated with a unique positive semidefinite tion and a quadratic optimization problem over the Y-kernel H, called the reproducing kernel. Conversely, cone of positive semidefinite matrices. Exploiting the given a positive semidefinite Y-kernel H on X , there specific structure of the problem, we propose an effi- exists a unique RKHS of Y-valued functions defined cient solution for both steps. over X whose reproducing kernel is H. The standard definition of positive semidefinite (scalar) kernel and The four contributions of this paper are as follows. RKHS (of real-valued functions) can be recovered by First, we introduce a new problem of learning simul- letting Y = . taneously a vector-valued function and the similarity R between its components. Second, we show that our For the rest of the paper, we assume Y = Rm. In non-convex objective is in fact invex, and hence local such a case, L(Y) is just the space of square matrices minimizers are global minimizers. Third, we derive of order m. By fixing a basis {bi}i∈T for the output an efficient block-wise coordinate descent algorithm to space, where T = {1, . . . , m}, one can uniquely define solve our problem. Finally, we show empirical evidence an associated (scalar-valued) kernel R over X ×T such that our method can improve performances, and also that discover meaningful structures in the output space. hbi,H(x1, x2)bjiY = R((x1, i), (x2, j)), 2. Spaces of vector-valued functions so that an Y-kernel can be seen as a function that maps We review some facts used to model learning tasks two inputs into the space of square matrices of order m. such as multiple output regression, multiclass and mul- Similarly, by fixing any function g : X → Y, one can tilabel classification. The key idea is to learn functions uniquely define an associated function h : X × T → R with values in a vector space. such that X g(x) = h(x, i)b . 2.1. Reproducing Kernel Hilbert Spaces i i∈T We provide some background here on reproducing ker- Consequently, a space of Y-valued functions over X is nel Hilbert spaces (RKHSs) of vector-valued functions, isomorphic to a standard space of scalar-valued func- i.e. where the outputs are vectors instead of scalars. tions defined over the input set X × T . In conclusion, Let Y denote an Hilbert space with inner product by fixing a basis for the output space, the problem of h·, ·i , and L(Y) the space of bounded linear operators Y learning a vector-valued function can be equivalently from Y into itself. The following definitions introduce regarded as a problem of learning a scalar function de- positive semidefinite Y-kernel and RKHS of Y-valued fined over an enlarged input set, a modeling approach functions, that extend the classical concepts of positive that is also used in the structured output learning set- semidefinite kernel and RKHS. See Micchelli & Pontil ting (Bakir et al., 2007). (2005); Carmeli et al. (2006) for further details. Definition 2.1 (Positive semidefinite Y-kernel). Let 2.2. Learning Tasks X denote a non-empty set and Y an Hilbert space. A symmetric function H : X × X → L(Y) is called As mentioned in the introduction, many learning tasks positive semidefinite Y-kernel on X if, for any finite can be solved by first learning a function taking val- integer `, the following holds ues in Rm. We list here brief descriptions of some ` ` commonly used models. First of all, in multiple out- X X put regression, each component of the vector directly hyi,H(xi, xj)yjiY ≥ 0, ∀(xi, yi) ∈ (X , Y) . i=1 j=1 corresponds to an output. For multiclass classifica- tion, output data are modeled as binary vectors, with We now introduce Hilbert spaces of vector-valued func- +1 in the position corresponding to the class. By tions that can be associated to positive semidefinite interpreting the components of the learned vector- Y-kernels. valued function g as “confidence scores” for the differ- Definition 2.2 (RKHS of Y-valued functions). A Re- ent classes, one can consider a classification rule of the producing Kernel Hilbert Space (RKHS) of Y-valued form yb(x) = arg maxi∈T gi(x). For binary multilabel functions g : X → Y is an Hilbert space H such that, classification, outputs are vectors with ±1 at the dif- ferent components. The final classification rule is given for all x ∈ X , there exists Cx ∈ R such that by the component-wise application of a sign function: kg(x)k ≤ C kgk , ∀g ∈ H. Y x H ybi(x) = sign(gi(x)). Learning output kernels with block coordinate descent 3. Learning an output kernel In this section, we introduce and study an optimization problem that can be used to learn simultaneously a 3 vector-valued function and a kernel on the outputs. 2

Q 3.1. Matrix notation 1

m m m 0 Let S++ ⊆ S+ ⊆ M denote the open cone of positive definite matrices, the closed cone of positive semidefi- 1 nite matrices, and the space of square matrices of or- 2 m der m, respectively. For any A, B ∈ M , denote the 2.0 T 1.5 Frobenius inner product hA, BiF := tr(A B), and the 1.0 p 0.5 C induced norm kAkF := hA, AiF . We denote the 0.2 T 0.4 0.0 identity matrix as I and, for any matrix A, let A de- 0.6 0.5 † L 0.8 note its transpose, and A its Moore-Penrose pseudo- 1.0 1.0 inverse. The Kronecker product is denoted by ⊗, and the vectorization operator by vec(·). In particular, for Figure 1. The objective function for scalar K and L. Al- matrices A, B, C of suitable size, we use the identity though the level sets are non-convex, stationary points are tr(AT BAC) = vec(A)T (CT ⊗ B)vec(A). also global minimizers.

3.2. Regularized risk where ci ∈ Y (i = 1, . . . , `) are suitable vectors. Let K ∈ ` be such that K = K(x , x ), and Y, C ∈ In the following, we assume Y = m. Let H denote S+ ij i j R `×m such that the RKHS of Y-valued functions g : X → Y associated R with the kernel H defined as T T Y = (y1, . . . , y`) , C = (c1, . . . , c`) .

H = K · L, By plugging the expression (2) of g into the objective where K is a positive semidefinite scalar kernel on functional of problem (1), we obtain X that measures the similarity between inputs, and min min Q (L, C) , m L∈ m `×m L ∈ S+ is a symmetric positive semidefinite matrix S+ C∈R that encodes the relationships between the output’s components. where 2 T 2 Our method simultaneously learns a function g ∈ H kY − KCLk hC KC, LiF kLk Q (L, C) := F + + F . and the matrix L (output kernel) from a set of ` input- 2λ 2 2 (3) output data pairs (xi, yi) ∈ X × Y. We adopt a reg- ularization approach where a penalty on the function Observe that Q is strongly convex with respect to L g and on the output kernel matrix L is used to suit- and convex with respect to C. However, Q is not ably constrain the solution and make the problem well (quasi)-convex with respect to the pair (L, C). This posed: can be seen from Figure 1, by noticing that the level sets of the objective functional in the two-dimensional " ` 2 2 2 !# case (m = ` = 1) are non-convex. Nevertheless, we will X kg(xi) − yik kgk kLk min min 2 + H + F . show in Theorem 3.3 that the objective has the prop- L∈ m g∈H 2λ 2 2 S+ i=1 erty that local minimizers are global minimizers. Fur- (1) thermore, we derive an efficient optimization method Note that the objective functional contains only one in Section 4. regularization parameter. A second regularization pa- rameter is not needed as it can be shown to be equiva- 3.3. Invex functions lent to rescaling K. According to the representer the- orem for vector-valued functions (Micchelli & Pontil, Recall that a function f defined over a is 2005), the inner minimization problem admits an op- called quasiconvex if all its sublevel sets are convex. timal solution of the form Although the function Q defined in (3) is not quasi-

` ` convex (and hence also not convex), it turns out that it ∗ X X is a so-called invex function, so that stationary points g (x) = H(x, xi)ci = L ciK(x, xi), (2) i=1 i=1 are also global minimizers. We now review some basic Learning output kernels with block coordinate descent facts about invex functions. For more details, we refer convex, we only need to prove convexity of h(L, Z). to (Mishra & Giorgi, 2008). Let Definition 3.1 (Invex function). Let A denote an H = L ⊗ K, z = vec(Z), (5) open set. A differentiable function f : A → R is called and observe that it suffices to prove convexity in the invex if there exists a function η : A × A → Rn such effective domain of h, where that HH†z = z. T f(x1) − f(x2) ≥ η(x1, x2) ∇f(x2), ∀x1, x2 ∈ A. Now, we also have

T † † T † The definition of an invex function is a generaliza- hZ K Z, L iF = z H z, tion of the gradient inequality for differentiable convex functions, which corresponds to η(x1, x2) = x1 − x2. and it turns out that this last function is jointly convex Remarkably, invex functions can be characterized as with respect to z and H. Indeed, by the generalized follows (Craven & Glover, 1985; Ben-Israel & Mond, Schur complement lemma (Albert, 1969), we have 1986).  H z  zT H†z ≤ α ⇔ ∈ m+1, Theorem 3.2. A differentiable function f : A → R is zT α S+ invex if and only if every stationary point is a global minimizer. so that the is convex. Since (H, z) is a linear function of (L, Z), the functional is also convex with While differentiable convex functions are particular in- respect to these variables. In conclusion, h and hence stances of invex functions, the class of invex functions F are jointly convex. is much larger. The class of quasi-convex functions partially overlaps with the class of invex functions. For Now, let (L∗, C∗) denote a stationary point of Q where 3 L is positive definite, and set Z∗ = KC∗L∗. In view instance, the function fQC (x) = x is quasi-convex but 2 −x2 of (4), it follows that not invex, while the function fI (x1, x2) = 1+x1 −e 2 is invex but not quasi-convex. For these and other re- ∂Q ∂F ∂F 0 = (L , C ) = (L , Z ) + CT K (L , Z ) lationships between invex functions and other classes ∂L ∗ ∗ ∂L ∗ ∗ ∗ ∂Z ∗ ∗ of functions, see (Mishra & Giorgi, 2008). ∂Q ∂F 0 = (L∗, C∗) = K (L∗, Z∗)L∗ ∂C∗ ∂Z∗ 3.4. The objective is invex With the definition of invex functions above, we can In particular, let now state the following result. ∂F S := (L , Z ) = Y − λC − Z , Theorem 3.3. Functional Q defined as in (3) is an ∂Z ∗ ∗ ∗ ∗ invex function over the open set m × `×m. In par- S++ R and let U ∈ `×m denote the matrix whose columns ticular, local minimizers are also global minimizers. R are obtained by projecting the columns of S over the nullspace of K. Then, letting Proof of Theorem 3.3. First of all, let Ce = C∗ − U/λ, Ze = KCLe ∗, kY − Zk2 kLk2 F (L, Z) = F + F + h(L, Z), 2λ 2 we have that (L∗, Ce ) is still stationary for Q, and in addition where Q(L , C ) = Q(L , C) = F (L , Z).  hZT K†Z, L−1i /2, Z = KCL ∗ ∗ ∗ e ∗ e h(L, Z) = F +∞, otherwise   Furthermore, the columns of Y − λCe − Ze belong to and observe that, by the properties of the trace and the range of K, therefore we also have the pseudo-inverse, we have ∂F 0 = (L∗, Ze) = Y − λCe − Ze. Q(L, C) = F (L, KCL). (4) ∂Z As a consequence, it also follows Now, we show that F is jointly convex in L and Z. ∂F 0 = (L∗, Ze). Since the first two terms are quadratic an thus jointly ∂L Learning output kernels with block coordinate descent

In conclusion, the pair (L∗, Ze) is a stationary point Algorithm 1 Block-wise coordinate descent for F and, since F is convex, also a global minimizer. 1: L, C, E, Z ← 0 Hence, we have 2: while kZ + λC − YkF ≥ δ do 3: C ← Solution to KCL + λC = Y, Q(L∗, C∗) = F (L∗, Ze) ≤ F (L, Z), 4: E ← KC 1 T 5: P ← 2 E C − L for any (L, Z), in particular for Z = KCL. In view of 6: Q ← Solution to ET E + λI Q = P (4), it follows that (L∗, C∗) is a global minimizer of Q. 7: L ← L + λQ Finally, it follows by Theorem 3.2 that Q is invex. 8: Z ← EL 9: end while 4. A block coordinate descent method In the following, we derive a block-wise coordinate de- positive semidefinite constraint is never active in the scent technique (Algorithm 1) for minimizing the func- sub-problem with respect to L. Namely, we have that tional of equation (3). The algorithm alternates be- arg min Q (L, C) = arg min Q (L, C) , (7) m m tween minimization with respect to L and minimiza- L∈S+ L∈M tion with respect to C. The key computational steps since the solution of the problem on the right hand side are in lines 3 and 6, which are derived in Section 4.1 is automatically symmetric and positive semidefinite. and 4.2 respectively. Lemma 4.1. Assume that C is a solution of Equa- tion (6) with L = L and let E := KC. If L solves the 4.1. Sub-problem w.r.t. coefficient matrix C p subproblem at the left hand side of (7), then we have From Equation (3) we see that for any fixed L, C is L = L + λQ optimal if and only if p where Q solves the linear system ∂Q (L, C) K (Y − λC − KCL) L 0 = = − . 1 ∂C λ ET E + λI Q = P, P := ET C − L . (8) 2 p Hence, a sufficient condition for C to be optimal is Proof. Observe that matrix L is optimal for the prob- Y − λC − KCL = 0, (6) lem on the right hand side of (7) if and only if ∂Q (L, C) CT K (Y − λC/2 − KCL) a linear matrix equation called discrete-time Sylvester 0 = = − + L, ∂L λ equation (Sima, 1996). In principle, this can be rewrit- ten and solved as a large linear system of equations of that is   the form −1 λ L = CT K2C + λI CT K Y − C . (9) 2 (L ⊗ K + λI)vec(C) = vec(Y). Now, recall that C satisfies However, this naive solution is not efficient for large λ λ scale problems. Sylvester equations arise in control Y − C = C + KCL , 2 2 p theory, and there exist many efficient techniques that take advantage of the structure. We use an algorithm where Lp denote the previous L, which is positive implemented in SLICOT 1. semidefinite. Hence, equation (9) reads −1  λ  L = CT K2C + λI CT K2CL + CT KC , 4.2. Sub-problem w.r.t. output kernel L p 2 (10) For any fixed C, the problem with respect to L is a showing that L is symmetric and positive semidefinite. strongly problem in the cone of By letting E := KC, the update (10) can be rewritten symmetric positive semidefinite matrices. We derive in a more compact form: an update that maintains positive semidefiniteness and can be computed by solving a linear system. T −1 T λ T  L = E E + λI E ELp + 2 E C T −1 1 T  A key observation is the following: if C has been ob- = Lp + λ E E + λI 2 E C − Lp tained by solving the Sylvester equation (6), then the = Lp + λQ

1www.slicot.org where (8) holds. Learning output kernels with block coordinate descent

multiclass_sim0 multiclass_sim1 50 50 learn L learn L 45 45 identity identity 40 40

35 35 accuracy (%) accuracy (%) 30 30

25 25 10-2 10-1 100 101 102 103 104 10-2 10-1 100 101 102 103 104 2 5 0 0 2 4 5 6 8 10 10 15 12 14 20 improvement from learning L 10-2 10-1 100 101 102 103 104 improvement from learning L 10-2 10-1 100 101 102 103 104 regularization parameter regularization parameter

multiclass_sim2 multiclass_sim3 55 45 learn L learn L 50 40 45 identity identity 35 40 30 35 25

accuracy (%) 30 accuracy (%) 25 20 20 15 10-2 10-1 100 101 102 103 104 10-2 10-1 100 101 102 103 104 10 5 5 0 0 5 5

10 10 15 15 20 25 20 improvement from learning L 10-2 10-1 100 101 102 103 104 improvement from learning L 10-2 10-1 100 101 102 103 104 regularization parameter regularization parameter

Figure 2. Improvement in classification accuracy from learning the output kernel. The four sub-figures correspond to the four different datasets as described in the text. The top plot in each sub-figure shows the accuracy of the classifier when using the identity kernel (green x) and when using the learned output kernel (blue +). The bottom plots in each subfigure shows the difference between the accuracy of the learned output kernel classifier and the best standard classifier. That is, for each split, we find the peak accuracy of the classifier with identity output kernel, and use this as a baseline. The results demonstrate that when there is structure in the labels, our method can utilize this to improve accuracy.

5. Experiments ples for each class and hence improve performance. In the experiments, we efficiently compute the solution We present two types of empirical evidence showing 2 for several values of the regularization parameter λ by the usefulness of our method in the multiclass setting . using a warm-start procedure. First, we show that learning the class similarities can result in improved accuracies (Section 5.1). Second, We generated data from a 100-dimensional mixture we visualize the learned structure between classes for of Gaussians with unit variance. The means of the some larger datasets (Section 5.2). Gaussians are placed on the simplex√ centered around the origin, with edges of length 2. All datasets have 5.1. Improvement in classification accuracy 5 labels, (denoted 1,2,3,4,5) but had varying number of classes. Up to 5 classes were considered (denoted We simulate an easy structure discovery problem, as a,b,c,d,e), with each class corresponding to a Gaus- namely when classes are mistakenly split. This verifies sian. The first dataset sim0 consists of 5 independent the intuition that, by grouping classes that are really classes, and the other 3 datasets were generated with the same, we increase the effective number of exam-

2people.tuebingen.mpg.de/fdinuzzo/okl.html Learning output kernels with block coordinate descent

(a) USPS digits (b) Caltech 101 (c) Caltech 256

Figure 3. Interpreting the output kernel, by visualizing edges associated with off diagonal entries of L with largest values. the following label structure: obtained by setting a threshold on the entries of L to contain only the largest inter class similarities, and sim1 {0, 1}, {2} , {3} , {4} than placing the resulting graph in a visually appeal- | {z } |{z} |{z} |{z} 3 a b c d ing way with Nodebox . In the following experiments, sim2 {0} , {1} , {2, 3, 4} the learned matrices L are almost diagonal. Perfor- |{z} |{z} | {z } a b c mances are comparable with state of the art methods. sim3 {0} , {1, 2}, {3, 4} The experiments on the largest datasets took roughly |{z} | {z } | {z } a day to complete on a standard desktop. The limiting a b c factor was the solution of the Sylvester equation. Each class contains 500 examples, which are then uni- formly allocated labels at random when split. For ex- 5.2.1. USPS digits ample, sim1 is a mixture of 4 Gaussians, the first of which contains a 50/50 mix of labels 0 and 1. Note We use the standard training/test split of USPS digits, that in this case, even a perfect classifier would only and used a Gaussian kernel of width 40 on the raw obtain 87.5% accuracy since on a quarter of the ex- pixel representation. The accuracy achieved (96.5%) amples, it would have 50% accuracy. In 20 different is comparable to published results. A visualization of random splits, 5% of the examples were used for train- the matrix L containing the top 10 similarities is shown ing with the linear kernel, and the rest for testing. The in Figure 3(a). Note that, since we use the Gaussian small training sets increase the difficulty of the prob- kernel on the pixels, the local information in the image lem, to explore the case when there is barely enough is lost. It follows that the learned the class similarities data to train the classifier. are “blind” to shape information. Classes that have correlated “digit” regions are deemed similar. From the results shown in Figure 2, we observe that in all cases with split classes, learning the kernel between 5.2.2. Caltech image classification labels significantly improve the performance. Since we have to estimate more parameters in our framework, We used the Caltech image classification data (Fei-Fei we unsurprisingly perform worse in sim0, when the et al., 2004; Griffin et al., 2007), with 30 examples labels are truly independent. per class. The similarity between images was com- puted using average of the kernels used in (Gehler & 5.2. Interpreting the learned similarity Nowozin, 2009). This included the PHOG shape de- scriptor, the SIFT appearance descriptor, region co- The output kernel L provides the full information of variance, local binary patterns and local Gabor func- distances between classes. We present a visualization tions. The results on Caltech 101 and Caltech 256 are of L by showing the graph whose edges are associ- 3http://nodebox.net/code/index.php/Graph ated with the largest entries. The visualization was Learning output kernels with block coordinate descent shown in figure 3(b) and 3(c). In the figures, singleton A. Vector field learning via spectral filtering. classes are not shown to avoid clutter. The achieved In Machine Learning and Knowledge Discovery in accuracies (75.3% and 43.9% respectively) are compa- Databases, volume 6321, pp. 56–71. Springer, 2010. rable to published results. Ben-Israel, A. and Mond, B. What is invexity? J. In Figure 3(b), we can observe 7 clusters of objects. Australian Math. Soc. Ser. B, 28:1–9, 1986. Some of them are not surprising, such as the one con- taining Faces and Faces easy, or the “animal” cluster Caponnetto, A., Micchelli, C. A., Pontil, M., and Ying, containing emu, gerenuk, llama and kangaroo. The Y. Universal multi-task kernels. Journal of Machine latter is likely to be due to very similar backgrounds. Learning Research, 9:1615–1646, 2008. An interesting cluster, which is fully connected, con- Carmeli, C., Vito, E. De, and Toigo, A. Vector val- tains geometric objects with regular substructures, ued reproducing kernel Hilbert spaces of integrable such as cellphone, minaret, accordion, pagoda, and functions and Mercer theorem. Analysis and Appli- trilobite. From Figure 3(c), we observe several mean- cations, 4:377–408, 2006. ingful clusters, such as planets (mars and saturn), space objects (comet and galaxy), sticklike objects Craven, B.D. and Glover, B.M. Invex functions and (chopsticks, tweezer, sword), and objects of nature . Journal of Australian Mathematical Society, (cactus, fern, waterfall, toad, mushroom, grapes, hi- 24:120, 1985. biscus). We can conclude that the obtained visualiza- tions are intuitively plausible. In addition, the learned Evgeniou, T., Micchelli, C. A., and Pontil, M. Learn- similarity between classes depends both on the simi- ing multiple tasks with kernel methods. Journal of larity between the inputs and the class labels. Machine Learning Research, 6:615–637, 2005. Fei-Fei, L., Fergus, R., and Perona, P. Learning gener- 6. Conclusions ative visual models from few training examples: An incremental bayesian approach tested on 101 object In the paper, we proposed a method to learn simulta- categories. In CVPR, pp. 178, 2004. neously a vector-valued function and a kernel between the output’s components. The learned kernel encodes Gehler, P. and Nowozin, S. On feature combination structural relationship between the outputs and can for multiclass object classification. In ICCV, pp. be used for visualization purposes. 221–228, 2009. Our method is formulated as a regularization prob- Griffin, G., Holub, A., and Perona, P. Caltech-256 ob- lem within the framework of RKHS of vector-valued ject category dataset. Technical Report 7694, Cal- functions. Although the objective functional is non- tech, 2007. convex, we prove that all the stationary points are global minimizers, a property called invexity. By ex- Lampert, C.H. and Blaschko, M.B. A multiple kernel ploiting the specific structure of the problem, we derive learning approach to joint multi-class object detec- an efficient block-wise coordinate descent algorithm tion. In DAGM, 2008. that alternates between the solution of a discrete time Sylvester equation and a closed-form update for the Micchelli, C. A. and Pontil, M. On learning vector- output kernel matrix. The method is applied to learn- valued functions. Neural Computation, 17:177–204, ing label similarity in a multiclass setting. We can 2005. obtain an improvement in classification performance, Mishra, S. K. and Giorgi, G. Invexity and optimiza- and visualize the learned structure in the output space. tion. Nonconvex Optimization and Its Applications. Springer, Dordrecht, 2008. References Sima, V. Algorithms for Linear-quadratic Optimiza- Albert, A. Conditions for positive and nonnegative tion. Marcel Dekker, New York, 1996. definiteness in terms of pseudoinverses. SIAM Jour- nal on Applied Mathematics, 17(2):434–440, 1969. Zien, A. and Ong, C.S. Multiclass multiple kernel learning. In ICML, pp. 1191–1198, 2007. Bakir, G., Hofmann, T., Sch¨olkopf, B., Smola, A., Taskar, B., and Vishwanathan, S. V. N. Predicting Structured Data. The MIT Press, 2007.

Baldassarre, L., Rosasco, L., Barla, A., and Verri,