Learning Output Kernels with Block Coordinate Descent
Total Page:16
File Type:pdf, Size:1020Kb
Learning Output Kernels with Block Coordinate Descent Francesco Dinuzzo [email protected] Max Planck Institute for Intelligent Systems, 72076 T¨ubingen,Germany Cheng Soon Ong [email protected] Department of Computer Science, ETH Z¨urich, 8092 Z¨urich, Switzerland Peter Gehler [email protected] Max Planck Institute for Informatics, 66123 Saarbr¨ucken, Germany Gianluigi Pillonetto [email protected] Department of Information Engineering, University of Padova, 35131 Padova, Italy Abstract the outputs. In this paper, we introduce a method that We propose a method to learn simultane- simultaneously learns a vector-valued function and the ously a vector-valued function and a kernel kernel between the components of the output vector. between its components. The obtained ker- The problem is formulated within the framework of nel can be used both to improve learning per- regularization in RKH spaces. We assume that the formance and to reveal structures in the out- matrix-valued kernel can be decomposed as the prod- put space which may be important in their uct of a scalar kernel and a positive semidefinite kernel own right. Our method is based on the so- matrix that represents the similarity between the out- lution of a suitable regularization problem put components, a structure that has been considered over a reproducing kernel Hilbert space of in a variety of works, (Evgeniou et al., 2005; Capon- vector-valued functions. Although the regu- netto et al., 2008; Baldassarre et al., 2010). larized risk functional is non-convex, we show In practice, an important issue is the choice of the ker- that it is invex, implying that all local min- nel matrix between the outputs. In some cases, one imizers are global minimizers. We derive a may be able to fix it by using prior knowledge about block-wise coordinate descent method that the relationship between the components. However, efficiently exploits the structure of the objec- such prior knowledge is not available in the vast ma- tive functional. Then, we empirically demon- jority of the applications, therefore it is important to strate that the proposed method can improve develop data-driven tools to choose the kernel auto- classification accuracy. Finally, we provide matically. The learned kernel matrix can be also used a visual interpretation of the learned kernel for visualization purposes, revealing interesting struc- matrix for some well known datasets. tures in the output space. For instance, when applying the method to a multiclass classification problem, the learned kernel matrix can be used to cluster the classes 1. Introduction into homogeneous groups. Previous works, such as Classical learning tasks such as binary classification (Zien & Ong, 2007; Lampert & Blaschko, 2008), have and regression can be solved by synthesizing a real- shown that learning convex combinations of kernels on valued function from the data. More generally, prob- the output space can improve performance. lems such as multiple output regression, multitask Our method searches over the whole cone of positive learning, multiclass and multilabel classification can semidefinite matrices. Although the resulting opti- be addressed by first learning a vector-valued func- mization problem is non-convex, we prove that the ob- tion, and then applying a suitable transformation to jective is an invex function (Mishra & Giorgi, 2008), Appearing in Proceedings of the 28 th International Con- namely it has the property that stationary points ference on Machine Learning, Bellevue, WA, USA, 2011. are global minimizers. Then, we propose an efficient Copyright 2011 by the author(s)/owner(s). block-wise coordinate descent algorithm to compute Learning Output Kernels with Block Coordinate Descent an optimal solution. The algorithm alternates between It turns out that every RKHS of Y-valued functions H the solution of a so-called discrete time Sylvester equa- can be associated with a unique positive semidefinite tion and a quadratic optimization problem over the Y-kernel H, called the reproducing kernel. Conversely, cone of positive semidefinite matrices. Exploiting the given a positive semidefinite Y-kernel H on X , there specific structure of the problem, we propose an effi- exists a unique RKHS of Y-valued functions defined cient solution for both steps. over X whose reproducing kernel is H. The standard definition of positive semidefinite (scalar) kernel and The four contributions of this paper are as follows. RKHS (of real-valued functions) can be recovered by First, we introduce a new problem of learning simul- letting Y = . taneously a vector-valued function and the similarity R between its components. Second, we show that our For the rest of the paper, we assume Y = Rm. In non-convex objective is in fact invex, and hence local such a case, L(Y) is just the space of square matrices minimizers are global minimizers. Third, we derive of order m. By fixing a basis fbigi2T for the output an efficient block-wise coordinate descent algorithm to space, where T = f1; : : : ; mg, one can uniquely define solve our problem. Finally, we show empirical evidence an associated (scalar-valued) kernel R over X ×T such that our method can improve performances, and also that discover meaningful structures in the output space. hbi;H(x1; x2)bjiY = R((x1; i); (x2; j)); 2. Spaces of vector-valued functions so that an Y-kernel can be seen as a function that maps We review some facts used to model learning tasks two inputs into the space of square matrices of order m. such as multiple output regression, multiclass and mul- Similarly, by fixing any function g : X!Y, one can tilabel classification. The key idea is to learn functions uniquely define an associated function h : X × T ! R with values in a vector space. such that X g(x) = h(x; i)b : 2.1. Reproducing Kernel Hilbert Spaces i i2T We provide some background here on reproducing ker- Consequently, a space of Y-valued functions over X is nel Hilbert spaces (RKHSs) of vector-valued functions, isomorphic to a standard space of scalar-valued func- i.e. where the outputs are vectors instead of scalars. tions defined over the input set X × T . In conclusion, Let Y denote an Hilbert space with inner product by fixing a basis for the output space, the problem of h·; ·i , and L(Y) the space of bounded linear operators Y learning a vector-valued function can be equivalently from Y into itself. The following definitions introduce regarded as a problem of learning a scalar function de- positive semidefinite Y-kernel and RKHS of Y-valued fined over an enlarged input set, a modeling approach functions, that extend the classical concepts of positive that is also used in the structured output learning set- semidefinite kernel and RKHS. See Micchelli & Pontil ting (Bakir et al., 2007). (2005); Carmeli et al. (2006) for further details. Definition 2.1 (Positive semidefinite Y-kernel). Let 2.2. Learning Tasks X denote a non-empty set and Y an Hilbert space. A symmetric function H : X × X ! L(Y) is called As mentioned in the introduction, many learning tasks positive semidefinite Y-kernel on X if, for any finite can be solved by first learning a function taking val- integer `, the following holds ues in Rm. We list here brief descriptions of some ` ` commonly used models. First of all, in multiple out- X X put regression, each component of the vector directly hyi;H(xi; xj)yjiY ≥ 0; 8(xi; yi) 2 (X ; Y) : i=1 j=1 corresponds to an output. For multiclass classifica- tion, output data are modeled as binary vectors, with We now introduce Hilbert spaces of vector-valued func- +1 in the position corresponding to the class. By tions that can be associated to positive semidefinite interpreting the components of the learned vector- Y-kernels. valued function g as “confidence scores" for the differ- Definition 2.2 (RKHS of Y-valued functions). A Re- ent classes, one can consider a classification rule of the producing Kernel Hilbert Space (RKHS) of Y-valued form yb(x) = arg maxi2T gi(x): For binary multilabel functions g : X!Y is an Hilbert space H such that, classification, outputs are vectors with ±1 at the dif- ferent components. The final classification rule is given for all x 2 X , there exists Cx 2 R such that by the component-wise application of a sign function: kg(x)k ≤ C kgk ; 8g 2 H: Y x H ybi(x) = sign(gi(x)): Learning Output Kernels with Block Coordinate Descent 3. Learning an output kernel In this section, we introduce and study an optimization problem that can be used to learn simultaneously a 3 vector-valued function and a kernel on the outputs. 2 Q 3.1. Matrix notation 1 m m m 0 Let S++ ⊆ S+ ⊆ M denote the open cone of positive definite matrices, the closed cone of positive semidefi- 1 nite matrices, and the space of square matrices of or- 2 m der m, respectively. For any A; B 2 M , denote the 2.0 T 1.5 Frobenius inner product hA; BiF := tr(A B), and the 1.0 p 0.5 C induced norm kAkF := hA; AiF . We denote the 0.2 T 0.4 0.0 identity matrix as I and, for any matrix A, let A de- 0.6 0.5 y L 0.8 note its transpose, and A its Moore-Penrose pseudo- 1.0 1.0 inverse. The Kronecker product is denoted by ⊗, and the vectorization operator by vec(·). In particular, for Figure 1. The objective function for scalar K and L. Al- matrices A; B; C of suitable size, we use the identity though the level sets are non-convex, stationary points are tr(AT BAC) = vec(A)T (CT ⊗ B)vec(A).