Vector-valued Manifold Regularization

H`aQuang Minh [email protected] Italian Institute of Technology, Via Morego 30, Genoa 16163,Italy Vikas Sindhwani [email protected] Mathematical Sciences, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 USA

Abstract ing,i.e.,learningfromunlabeledexamplesbyexploit- ing the geometric structure of the marginal probabil- We consider the general problem of learn- ity distribution over the input space, and (2) struc- ing an unknown functional dependency, f : tured multi-output prediction,i.e.,learningtosimulta- X!→ Y,betweenastructuredinputspace neously predict a collection of output variables by ex- X and a structured output space Y,fromla- ploiting their inter-dependencies. We point the reader beled and unlabeled examples. We formu- to Chapelle et al. (2006)andBakir et al. (2007)for late this problem in terms of data-dependent several representative papers on semi-supervised learn- regularization in Vector-valued Reproducing ing and structured prediction respectively. In this pa- Kernel Hilbert Spaces (Micchelli & Pontil, per, we consider a problem at the intersection of these 2005)whichelegantlyextendfamiliarscalar- threads: non-parametric estimation of a vector-valued valued kernel methods to the general set- function, f : X!→Y ,fromlabeledandunlabeledex- ting where Y has a Hilbert space structure. amples. Our methods provide a natural extension of Manifold Regularization (Belkin et al., Our starting point is multivariate regression in 2006)algorithmstoalsoexploitoutput aregularizedleastsquares(RLS)framework(see, inter-dependencies while enforcing smooth- e.g., Brown & Zidek (1980)), which is arguably the ness with respect to input data geometry. classical precursor of much of the modern literature We propose a class of matrix-valued kernels on structured prediction, multi-task learning, multi- which allow efficient implementations of our label classification and related themes that attempt algorithms via the use of numerical solvers to exploit output structure. We adopt the for- for Sylvester matrix equations. On multi- malism of Vector-valued Reproducing Kernel Hilbert label image annotation and text classification Spaces (Micchelli & Pontil, 2005)toposefunction problems, we find favorable empirical com- estimation problems naturally in an RKHS of Y- parisons against several competing alterna- valued functions, where Y in general can be an tives. infinite-dimensional (Hilbert) space. We derive an ab- stract system of functional linear equations that gives the solution to a generalized Manifold Regulariza- 1. Introduction tion (Belkin et al., 2006)frameworkforvector-valued semi-. For multivariate problems The statistical and algorithmic study of regression and with n output variables, the kernel K(·, ·)associated binary classification problems has formed the bedrock with a vector-valued RKHS is matrix-valued, i.e., for of modern . Motivated by new appli- any x, z ∈X, K(x, z) ∈ Rn×n.Weshowthatanatural cations, data characteristics, and scalability require- choice for a matrix-valued kernel leads to a Sylvester ments, several generalizations and extensions of these Equation, whose solution can be obtained relatively ef- canonical settings have been vigorously pursued in ficiently using techniques in numerical linear algebra. recent years. We point out two particularly domi- This leads to a vector-valued Laplacian Regularized nant threads of research: (1) semi-supervised learn- (Laplacian RLS) model that learns not Appearing in Proceedings of the 28 th International Con- only from the geometry of unlabeled data Belkin et al. ference on Machine Learning, Bellevue, WA, USA, 2011. (2006)butalsofromdependenciesamongoutputvari- Copyright 2011 by the author(s)/owner(s). ables estimated using an output graph Laplacian. We Vector-valued Manifold Regularization

find encouraging empirical results with this approach extend (1)asfollows, on semi-supervised multi-label classification problems, l in comparison to several recently proposed alterna- ! 1 2 2 T f =argmin (yi − f(xi)) + γA%f%k + γI f Lf tives. We begin this paper with relevant background f l ! ∈Hk i=1 material on Manifold Regularization and multivari- (2) ate RLS. Throughout the paper, we draw attention where γA,γI are referred to as ambient and intrin- to mathematical correspondences between scalar and sic regularization parameters. By using reproducing vector-valued settings. properties of Hk,theRepresentertheoremcancorre- spondingly be extended to show that the minimizer ! N 2. Background has the form, f (x)= i=1 αik(x, xi)involvingboth labeled and unlabeled data." The Laplacian RLS algo- T Let us recall the familiar regression and classification rithm estimates a =[α1 ...αN ] by solving the linear setting where Y = R.Letk : X×X!→ R be a N N N N system [Jl Gk + lγI LGk + lγAIN ]a = y where Gk is standard kernel with an associated RKHS family of the Gram matrix of k with respect to both labeled and functions Hk.Givenacollectionoflabeledexamples, unlabeled examples, IN is the N × N identity matrix, {x ,y }l ,kernel-basedpredictionmethodssetupa N i i i=1 Jl is an N × N diagonal matrix with first l diagonal problem, entries equaling 1 and the rest being 0 valued, and y is the N × 1labelvectorwithyi =0,i>l.Laplacian l RLS and Laplacian SVM tend to give similar empirical ! 1 2 f =argmin V (yi,f(xi)) + γ%f%k (1) performance (Sindhwani et al., 2005). f l ! ∈Hk i=1 Consider now two natural approaches to extending Laplacian RLS for the multivariate case Y = Rn.Let where the choice V (t, y)=(t − y)2 leads to Regular- f =(f1 ...fn)becomponentsofavector-valuedfunc- ized Least Squares (RLS) while V (t, y)=max(0, 1 − th tion where each fi ∈Hk.Letthej output label of yt)leadstotheSVMalgorithm.Bytheclassical the xi be denoted as yij .Then,oneformulationfor Representer theorem (Sch¨olkopf & Smola, 2002), this multivariate LapRLS is to solve, family of algorithms reduces to estimation of finite- dimensional coefficients, a =[α ,...,α ]T ,foramin- l n 1 l ! 1 2 ! f =argmin (y − f (x )) + γ %f %2 imizer that can be shown to have the form f (x)= l ! ! ij j i A j k l fj ∈Hk i=1 j=1 i=1 αik(x, xi). In particular, RLS reduces to solv- 1≤j≤n ing" the linear system, [Gl + γlI ]a = y where y = T k l l l +γI trace[F LF](3) T l [y1 ...yl] , Il is the l×l identity matrix and G denotes k α the Gram matrix of the kernel over the labeled data, where Fij = fj(xi), 1 ≤ i ≤ N,1 ≤ j ≤ n.Let i.e., (Gl ) = k(x ,x ). Let us now review two exten- be an N × n matrix of expansion coefficients, i.e., the k ij i j N sions of this algorithm: first for semi-supervised learn- minimizers have the form fj(x)= αij k(xi,x). It "i=1 ing, and then for multivariate problems where Y = Rn. is easily seen that the solution is given by, N N N α Semi-supervised learning typically proceeds by mak- [Jl Gk + lγI LGk + lγAlIN ] = Y (4) ing assumptions such as smoothness of the predic- where Y is the label matrix with Y =0fori>land tion function with respect to an underlying low- ij all j.Itisclearthatthismultivariatesolutionisequiv- dimensional data manifold or presence of clusters as alent to learning each output independently – ignor- detected using a relatively large set of u unlabeled ex- ing prior knowledge such as the availability of a sim- amples, {x }N=l+u.WewillusethenotationN = i i=l+1 ilarity graph W over output variables. Such prior l+u.InManifoldRegularization(Belkin et al., 2006), out knowledge can naturally be incorporated by adding a anearestneighborgraph,W ,isconstructed,which smoothing term to (3)which,forexample,enforcesf serves as a discrete probe for the geometric structure i to be close to f in the RKHS norm %·% if output of the data. The Laplacian L of this graph provides a j k i is similar to output j,i.e.,(W ) is sufficiently natural intrinsic measure of data-dependent smooth- out ij large. We defer this development to later in the pa- ness: per as both these two solutions are special cases of N abroadervector-valuedRKHSframeworkforLapla- T 1 2 f Lf = Wij (f(xi) − f(xj )) cian RLS where they correspond to certain choices of 2 ! i,j=1 amatrix-valuedkernel.Wefirstgiveaself-contained review of the language of vector-valued RKHS in the where f =[f(x1) ...f(xN )]. Thus, it is natural to following section. Vector-valued Manifold Regularization

3. Vector-Valued RKHS so that Kx is a bounded operator for each x ∈ X.Let ∗ Kx : HK →Ybe the adjoint operator of Kx,then The study of RKHS has been extended to vector- from (6), we have valued functions and further developed and ap- plied in machine learning (see (Carmeli et al., 2006; ∗ f(x)=Kxf for all x ∈ X, f ∈HK . (7) Micchelli & Pontil, 2005; Caponnetto et al., 2008)and references therein). In the following, denote by X a From this we deduce that for all x ∈ X and all f ∈HK , nonempty set, Y arealHilbertspacewiththeinner product (·, ·) , L(Y)theBanachspaceofboundedlin- ||f(x)|| ≤||K∗|| ||f|| ≤ ||K(x, x)|| ||f|| , Y Y x HK # HK ear operators on Y. that is for each x ∈ X,theevaluationoperatorE : X x Let Y denote the vector space of all functions f : ∗ HK →Ydefined by Exf = K f is a bounded linear X→Y.AfunctionK : X×X→L(Y)issaidto x operator. In particular, if κ =supx X ||K(x, x)|| < be an operator-valued positive definite kernel if ∈ # ∞,then||f||∞ =supx X ||f(x)||Y ≤ κ||f||HK for all for each pair (x, z) ∈X×X, K(x, z) ∈L(Y)isa ∈ f ∈HK .Inthispaper,wewillbeconcernedwith self-adjoint operator and kernels for which κ<∞. N (y ,K(x ,x )y ) ≥ 0(5)3.1. Vector-valued Regularized Least Squares ! i i j j Y i,j=1 Let Y be a separable Hilbert space. Let z = N N m for every finite set of points {xi}i=1 in X and {yi}i=1 {(xi,yi)i=1} be the given labeled data. Consider now in Y.Asinthescalarcase,givensuchaK,there the vector-valued version of regularized least square exists a unique Y-valued RKHS HK with reproducing algorithm in HK ,withγ>0: kernel K.TheconstructionofthespaceH proceeds K m as follows. For each x ∈ X and y ∈Y,weforma 1 2 2 z X f ,γ =argmin ||f(xi) − yi||Y + γ||f||HK . (8) function Kxy = K(., x)y ∈Y defined by f∈HK m ! i=1 K y z K z,x y z X. ( x )( )= ( ) for all ∈ From Inverse Problems literature, recall the famil- iar Tikhonov Regularization form in Hilbert spaces: Consider the set H0 =span{Kxy | x ∈ X, y ∈Y}⊂ N N 2 2 X arg minx∈H1 %Ax − b%H2 + γ%x%H1 ,whereH1, H2 are Y .Forf = i=1 Kxi wi, g = i=1 Kzi yi ∈H0,we define " " Hilbert Spaces and A : H1 !→ H2 is a bounded linear operator. To cast (8)intothisfamiliarform,wecan N consider an approach as in Caponnetto & Vito (2007). ( f,g )H = (wi,K(xi,zj )yj )Y . m K ! First, we set up the sampling operator Sx : HK →Y i,j=1 defined by Sx(f)=(f(x1),...,f(xm)). By definition, m Taking the closure of H0 gives us the Hilbert space we have for any f ∈HK and y =(y1,...,ym) ∈Y , m m ∗ x y m HK .Thereproducing property is (S f, )Y = i=1(Sxi f,yi)Y = i=1(f,Sxi yi)HK = m " m " i=1(f,Kxi yi)HK = (f, i=1 Kxi yi)HK .Itfollows f x ,y f,K y f . " "∗ m ( ( ) )Y = ( x )HK for all ∈HK (6) that the adjoint operator Sx : Y →HK is given by m ∗ ∗ As in the scalar case, applying Cauchy-Schwarz in- Sxy = Sx(y ,...,y )= K y ,andtheoperator 1 m ! xi i equality gives i=1 m ∗ ∗ |(f(x),y)Y |≤ ||K(x, x)|| ||f||HK ||y||Y . SxSx : H →H is given by SxSxf = K f(x ). # K K ! xi i i=1 Thus for each x ∈ X,eachy ∈Y,theevaluation We can now cast expression (8)intothefamiliar operator Ex|y : f →(f(x),y)Y is bounded as a linear Tikhonov form, operator from HK to R.Asinthescalarcase,the converse is true by the Riesz Representation Theorem. 1 2 2 z x m f ,γ =argmin ||S f − y||Y + γ||f||HK . f∈HK m Let Kx : Y→HK be the linear operator with Kxw defined as above, then This problem has a unique solution, given by 2 2 ||Kxy||H = (K(x, x)y, y)Y ≤||K(x, x)|| ||y||Y , ∗ −1 ∗ K fz,γ =(SxSx + mγI) Sxy (9) which implies that m z This function has the explicit form f ,γ = i=1 Kxi ai m " ||Kx : Y→HK || ≤ ||K(x, x)||, with fz,γ (x)= K(x, xi)ai,wherethevectors # "i=1 Vector-valued Manifold Regularization ai ∈Y satisfy the m linear equations For an arbitrary f ∈HK ,considertheorthogonalde- ⊥ m composition f = f0 + f1,withf0 ∈HK,x, f1 ∈HK,x. 2 2 2 K(x ,x )a + mγa = y . (10) Then, because ||f0 + f1|| = ||f0|| + ||f1|| ,the ! i j j i i HK HK HK j=1 result just obtained shows that for 1 ≤ i ≤ m.Thisresultwasfirstreportedin Il(f)=Il(f0 + f1) ≥ Il(f0) Micchelli & Pontil (2005). It is a special case of our

Proposition 1 below. In the scalar case Y = R it spe- with equality if and only if ||f1||HK =0,thatisf1 =0. cializes to the well known Regularized Least squares Thus the minimizer of (11)mustlieinHK,x. solution reviewed in Section 2. Note that in the finite-dimensional case Y = Rn,thekernelfunction 4.2. Vector-valued Laplacian RLS is matrix-valued; and unless it is diagonal, outputs are For manifold regularization of vector-valued functions not treated independently. using least square error, the minimization problem is 4. Vector-valued Manifold l 1 2 2 fz =argmin ||f(x ) − y || + γ ||f|| Regularization ,γ l ! i i Y A K i=1 l u+l Let z = {(xi,yi)i=1}∪{(xi)i=l+1} be the given set of +γI (f,Mf)Yu+l , (12) u+l u+l labeled and unlabeled data. Let M : Y →Y ∈ u+l u+l for some γA > 0,γI > 0. The operator M : Y → L(Y )beasymmetric,positiveoperator,thatis u+l u+l u+l Y can be expressed as an operator-valued matrix (y, My) u+l ≥ 0forally ∈Y .HereY is the Y M M u+l u l u l usual u + l-direct product of Y,withinnerproduct =( ij )i,j=1 of size ( + ) × ( + ), with each Mij : Y→Ybeing a linear operator, so that (Mf)i = u+l u+l u+l j=1 Mij fj = j=1 Mij f(xj). ((y1,...,yu+l), (w1,...,wu+l)) u+l = (yi,wi)Y . " " Y ! Proposition 1. i=1 The minimization problem (12)hasa u+l z unique solution f ,γ = i=1 Kxi ai,wherethevectors For f ∈H ,letf =(f(x ),...,f(x )) ∈Yu+l. " K 1 u+l ai ∈Y satisfy the u + l linear equations: Consider the following optimization problem u+l u+l l 1 2 K(xi,xj )aj +lγI MikK(xk,xj )aj +lγAai = yi, fz =argmin V (f(x ),y )+γ ||f|| ! ! ,γ i i A K j f∈HK l ! =1 j,k=1 i=1 (13) +γI (f,Mf)Yu+l , (11) for 1 ≤ i ≤ l,and where V is some convex error function, and γA,γI > 0 u+l are regularization parameters. γ M K(x ,x )a + γ a =0, (14) I ! ik k j j A i j,k=1 4.1. Representer Theorem for l +1≤ i ≤ u + l. Theorem 1. The minimization problem (11)hasa u+l Remark 1. We emphasize that in general, the vec- unique solution, given by fz,γ = Kx ai for some "i=1 i tors ai’s can be infinite dimensional, being elements vectors ai ∈Y, 1 ≤ i ≤ u + l. of Y.Thus(13)and(14)aregenerally linear sys- tems of functional equations.Thefinitedimensional Proof. Denote the right handside of (11)byIl(f). case, where we have a system of linear equations, is Then Il(f)iscoerciveandstrictlyconvexinf, discussed below. and thus has a unique minimizer. Let HK,x = u+l u+l ⊥ { i=1 Kxi yi : y ∈Y }.Forf ∈HK,x,bythe Proof. We know that (12)hasauniquesolutionby " reproducing property, the sampling operator Sx satis- Theorem 1.UsingthesamplingoperatorSx,wewrite fies 1 2 2 u+l fz =argmin||Sx f − y|| l + γ ||f|| ,γ l l Y A K + (Sxf,y)Yu l = (f, Kxi yi)HK =0. ! +γ (Sx f,MSx f) u+l . i=1 I ,u+l ,u+l Y This holds true for all y ∈Yu+l,thus Differentiating gives

∗ ∗ −1 ∗ x z x x y S f =(f(x1),...,f(xu+l)) = 0. f ,γ =(Sxl S l + lγAI + lγI Sx,u+lMS ,u+l) Sxl . Vector-valued Manifold Regularization

This is equivalent to 4.3. Vector-valued graph Laplacian ∗ ∗ ∗ x x z y (Sxl S l + lγAI + lγI Sx,u+lMS ,u+l)f ,γ = Sxl . Let G =(V,E)beanundirectedgraphofsize|V | = N, with symmetric, nonnegative weight matrix W .For By definition of the sampling operators, this is n an R -valued function f =(f1,...,fn), we define l u+l ∆f =(Lf1,...,Lfn), where L is the scalar Laplacian. K fz (x )+lγ fz + lγ K (Mfz ) ! xi ,γ i A ,γ I ! xi ,γ i Thus one can set ∆to be a block diagonal matrix of i=1 i=1 size nN × nN,witheachblock(i, i)beingthescalar l n graph Laplacian of size N ×N.Forf =(fk)k=1,where = K y , ! xi i fk =(fk(x1),...,fk(xN )), we have (f, ∆f)RnN = i=1 n 1 N 2 (fk,Lfk)RN = Wij ||f(xi)−f(xj )||Rn .It "k=1 2 "i,j=1 which we will rewrite as is clear that here we have ∆= In ⊗ L,where⊗ de- u+l notes the Kronecker (tensor) matrix product (compare γI fz,γ = − Kxi (Mfz,γ )i with the examples of the last section). It is straight- γA ! i=1 forward to generalize to Y-valued function spaces, l where Y is any separable Hilbert space, by defining yi − fz,γ (xi) N + Kxi . (∆f)i = j=1 Wij (f(xi) − f(xj)), and ! lγA i=1 " N 1 This shows that there are vectors ai’s in Y such 2 u+l (f, ∆f)YN = Wij ||f(xi) − f(xj )||Y . z z 2 ! that f ,γ = i=1 Kxi ai. We have f ,γ (xi)= i,j=1 u+l " j=1 K(xi,xj )aj ,and " Example 3. Using notation from 4.2,letM = L ⊗ In u+l u+l and K(x, z)=k(x, z)In where k(x, z)isascalarkernel (Mfz ) = M K(x ,x )a . and I is the n × n identity matrix. Then (15)reduces ,γ i ! ik ! k j j n k=1 j=1 to Multivariate Laplacian RLS (3)whichisequivalent Thus we have for 1 ≤ i ≤ l: to n independent scalar Laplacian RLS solutions. u+l u+l γI ai = − Mik K(xk,xj )aj 5. Matrix-valued Kernels and γA ! ! k=1 j=1 Numerical Implementation u+l yi − j=1 K(xi,xj )aj Proposition 1 provides general functional solutions to + " , lγA Vector-valued Manifold Regularization. We now fo- cus our attention to finite dimensional output spaces which gives formula (13). Similarly, ai = Rn γI u+l Y = and in particular to solving Equation (15). − MikK(xk,xj )aj ,forl +1 ≤ i ≤ u + l, γA "j,k=1 We address two questions: what is an appropri- implying (14). This completes the proof. ate choice for a matrix-valued kernel, and how can Example 1. For Y = R,andchoosingM to be the (15)besolvedmoreefficientlythantheprohibitive 3 3 graph Laplacian, it can be easily verified that (13,14) O(N n )complexityofageneraldenselinearsystem specializes to (2). This gives the scalar case for mani- solver. For a list of vector-valued kernels considered fold regularization as obtained in Belkin et al. (2006). so far in the literature, see Micchelli & Pontil (2005); Example 2. For Y = Rn (or any n-dimensional inner Caponnetto et al. (2008); Baldassarre et al. (2010). product space), the kernel K(x, z)isann × n matrix. For practical applications, in this paper we consider Let G be the Nn×Nn block matrix, whose (i, j)block matrix-valued kernels of the form, is the n × n matrix K(xi,xj ), where N = l + u.Let † K(xi,xj )=k(xi,xj ) &γOLout +(1− γO)In' (16) a =(a1,...,au+l)andy =(y1,...,yu+l)becolumn Rn(u+l) a ,y Rn y vectors in ,witheach i i ∈ ,and l+1 = where k(·, ·)isascalar-valuedkernel,0≤ γO < 1is ... y = u+l =0.Thenthesystemoflinearequations areal-valuedparameter,In is the n × n identity ma- (13), (14)becomes † trix and Lout is the pseudo-inverse of the normalized Nn Laplacian matrix, Lout,ofasimilaritygraphWout over Jnl G + lγI MG+ lγAINn a = y, (15) $ % outputs. We assume that Wout is either available as where M now is a positive semidefinite matrix of size prior knowledge or, as in Section 6,estimatedfrom Nn n(u + l) × n(u + l), Jnl =diag(1,...,1, 0,...,0), the nearest neighbor graphs over the available labels. We Nn× Nn diagonal matrix where the first nl diagonal note the following properties of this kernel that justi- positions are 1, and the rest 0. fies its choice in practical problems. Vector-valued Manifold Regularization

Universality:LetX be a Hausdorfftopological Vector-valued Manifold Regularization (abbreviated space, 0 ≤ γO < 1andk(·, ·)beauniversalscalar VVMR) implemented by solving (17). To the best valued kernel, e.g., the Gaussian kernel k(x, z)= of our knowledge, these are the first applications of "x−z"2 − 2 vector-valued RKHS methods to multi-label learning e 2σ .Then,HK is universal, i.e., any continuous function from a compact subset of X to Y can be uni- problems. We empirically explore the value of incorpo- rating unlabeled data in conjunction with dependen- formly approximated by functions in HK .Thisobser- vation follows from Theorem 12 of Caponnetto et al. cies across multiple output variables. A study is con- (2008). ducted to evaluate the quality of out-of-sample exten- sion to unseen test data in comparison to transductive Regularization:Letf =(f ,...,fn)beavector- 1 performance on the unlabeled set, and a comparison is valued function such that each fi ∈Hk.WhenγO =1, the norm of a vector-valued function in the RKHS in- presented against several recent state of the art multi- duced by (16)isgivenbyBaldassarre et al. (2010), label learning methods. For our performance evalua- tion metric, we compute mean of the area under the n n 1 (W ) ROC curve over all labels, as also used in Sun et al. %f%2 = %f − f %2 out ij + %f %2 K 2 ! i j k ! i k (2011). i,j didj i =1 # =1 Semantic Scene Annotation: The data for this where W is the adjacency matrix of a similarity graph problem domain is drawn from Boutell et al. (2004). over outputs, di = (Wout)ij ,and%·%k is the norm The task is to annotate images of natural scenes with 6 "j in the scalar-valued RKHS induced by k(·, ·). When semantic categories: Beach, Sunset, Fall foliage, Field, γO =0,K becomes diagonal and reduces to multi- Mountain and Urban given a 294-dimensional feature variate Laplacian RLS, Eq. (3), where each output is representation of color images. Thus each image x may 6 estimated independently of others. Thus, γO parame- be associated with a multi-label y ∈ R where yi =1 terizes the degree to which output inter-dependencies indicates presence and yi = −1indicatesabsenceof are enforced in the model. the ith semantic category. We split the 2407 images Computational Consequences:TheGrammatrix available in this dataset into a training set (labeled + of the vector-valued kernel is the Kronecker product, unlabeled) of 1204 images and a test set of 1203 im- N N G = Gk ⊗ Q,whereGk is the Gram matrix of ages. All results reported in section are averaged over the scalar kernel over labeled and unlabeled data and 10 random choices of 100 labels, treating the remain- † Q = &γOLout +(1− γO)In'.UsingM = L ⊗ In as in ing training images as unlabeled data. Section 4.3,thelinearsystem(15)becomesthefollow- RCV1 - Hierarchical Text Categorization: ing, The data for this problem domain is drawn from Lewis et al. (2004). The task is to place J Nn(GN ⊗ Q)+lγ (L ⊗ I )(GN ⊗ Q)+lγ I a = y. ( nl k I n k A N ) Reuters news-stories into a hierarchy of categories Applying only basic properties of Kronecker products, spanning various Corporate, Government, Markets it can be shown that this is identical to solving the and Economics related sub-topics. We used a following Sylvester Equation for the matrix A,where subset of 6000 documents split evenly between a = vec(AT ), atraining(labeled+unlabeled)andatestset, available at http://www.csie.ntu.edu.tw/~cjlin/ 1 N N N 1 − (J G + lγI LG )AQ − A + Y =0. (17) libsvmtools/datasets/multilabel.html.Weelim- lγ l k k lγ A A inated words with small document and category fre- This Sylvester equation can be solved much more ef- quencies resulting in a 2900-dimensional dataset with ficiently than directly calling a dense linear solver 25 categories. All results reported in this section are for (15). The number of floating point operations averaged over 10 random choices of 50 labels, treating 1 of a popular implementation (Golub et al., 1979)is the rest of the training set as unlabeled data. 5 3 3 2 2 2 N +10n +5N n +2.5Nn .Notethatthiscom- Hyperparameters:Notethatmodelselectioninsemi- plexity is of the same order as that of (non-linear) RLS supervised settings is a challenging problem in its own with N (or n)labeleddatapoints.Larger-scaleitera- right, and not the focus of this paper. For scene an- tive solvers are also available for such problems. notation, we set γA =0.0001 and used RBF scalar kernels with σ =4.3, a default value corresponding 6. Empirical Studies to the median of pairwise distances amongst train- ing datapoints. We use 5 and 2 nearest neighbor We consider semantic scene annotation and hierar- graphs to construct the input and the output normal- chical text categorization as empirical testbeds for ized Graph Laplacians respectively. For RCV1, we 1 use linear scalar kernels which are the most popular In Matlab X = dlyap(A, B, C)solvesAXB − X + C =0 Vector-valued Manifold Regularization

choice for text datasets. We set γA =1andused50 and 20 nearest neighbor graphs to construct the in- put (iterated to degree 5) and the output normalized 0.85 Graph Laplacians respectively. The fixed choices of some of these hyperparameters are based on values re- 0.8 ported in Sindhwani et al. (2005)forimageandtext datasets. No further optimization of these values was 0.75 attempted. 0.7 AUCs on Test Data

0.65 0.85 γ =1.0 O 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.84 AUCs on Unlabeled Data 0.83

0.82 1

0.81 0.9 0.8 Mean AUC

0.79 0.8 γ =0.0 0.78 O 0.7 0.77

0 1e−5 1e−4 1e−3 1e−2 1e−1 0.6 Influence of Unlabeled Data (γ ) I AUCs on Test Data 0.5

0.86 γ =1.0 O 0.4 0.84 0.5 0.6 0.7 0.8 0.9 1 AUCs on Unlabeled Data 0.82

0.8 Figure 2. Transductive and Inductive AUCs across outputs 0.78 (colored) on scene (top) and text (bottom) datasets.

Mean AUC 0.76

0.74 γ =0.0 O gesting that it also provides robustness against im- 0.72 precise model selection. Note that the special case, 0.7 0 1e−5 1e−4 1e−3 1e−2 1e−1 γI =0,γO > 0correspondsthevector-valuedRLS, Influence of Unlabeled Data (γ ) I while γI > 0,γO =0correspondstomultivariate Laplacian RLS. In general, vector-valued manifold reg- Figure 1. Influence of unlabeled data (γI )andoutputde- ularization outperforms these extremes. pendencies (γO): scene (top) and text (bottom) datasets. Out-of-sample Generalization: In Figure 6,we show a scatter plot of performance on the unseen test set (induction) versus the performance on un- Performance Enhancements with Unlabeled labeled data (transduction) for each of the different Data and Output Dependencies: In Figure 6,we labels across the 10 random choices of labeled data. evaluate Vector-valued Manifold Regularization in pre- In nearly all cases, we see high-quality out-of-sample dicting the unknown multi-labels of the unlabeled set generalization from the unlabeled set to the test set. as a function of γI and γO.Astheinfluenceofun- Purely transductive semi-supervised multilabel meth- labeled data is increased (γI taken steadily from 0 to ods (Chen et al., 2008; Liu et al., 2006)donotnatu- 0.1), on both datasets we see a significant improve- rally provide an out-of-sample extension. ment in performance for any fixed value of γO.This Comparisons: We compared VVMR,i.e.,solv- entire performance curve shows consistent lift as γO ing (17), with several recently proposed supervised is steadily increased from 0 to 1. These results show and semi-supervised multilabel learning methods: (1) the benefit of incorporating both the data geometry KCCA (Sun et al., 2011): a kernelized version of and the output dependencies in a single model. On Canonical correlation analysis applied to labeled input RCV1, unlabeled data shows greater relative benefit and output data; its ridge parameter was optimized when output dependencies are weakly enforced sug- over {0, 10k, −3 ≤ k ≤ 3},(2)M3L (Hariharan et al., Vector-valued Manifold Regularization

2010): a large-scale margin based method for multi- Belkin, M., Niyogi, P., and Sindhwani, V. Manifold reg- label learning; its C parameter was optimized in the ularization: A geometric framework for learning from range 10k, −3 ≤ k ≤ 3, (3) SMSE2 (Chen et al., labeled and unlabeled examples. Journal of Machine Learning Research,7:2399–2434,2006. 2008): a transductive method that enforces smooth- ness with respect to input and output graphs; its pa- Boutell, M.R., Luo, J., Shen, X., and Brown, C.M. Learn- ing multi-label scene classification. Pattern Recognition, rameters (analogous to γA,γO)wereoptimizedover the range reported in Chen et al. (2008)and(4) 2004. CNMF (Liu et al., 2006): a constrained non-negative Brown, P.J. and Zidek, J.V. On adaptive multivariate re- matrix factorization approach that also transductively gression. Annals of Statistics,1980. exploits a given input and output similarity structure; Caponnetto, A. and Vito, E. De. Optimal rates for regu- its parameters were also optimized over the ranges larized least-squares algorithm. Foundations of Compu- reported in Liu et al. (2006). Note that these semi- tational Mathematics,7(3):331–368,2007. supervised methods do not provide an out of sample Caponnetto, A., Pontil, M., C.Micchelli, and Ying, Y. Uni- extension as our method naturally can. From Table 1, versal multi-task kernels. Journal of Machine Learning we see that VVMR provides statistically significant Research,9:1615–1646,2008. improvements (paired t-test at p =0.05) over each of Carmeli, C., Vito, E. De, and Toigo, A. Vector valued these alternatives in predicting the labels of unlabeled reproducing kernel Hilbert spaces of integrable functions data. Exactly the same choices of input-output sim- and Mercer theorem. Analysis and Applications,4:377– ilarity graphs, and scalar kernel functions were used 408, 2006. across different methods in this comparison. Chapelle, O., Schlkopf, B., and Zien, A. (Ed). Semi- supervised Learning.MITPress,2006. Table 1. Comparison with competing algorithms (bold in- Chen, G., Song, Y., Wang, F., and Zhang, C. Semi- dicates statistically significant improvements) supervised multi-label learning by solving a sylvester Dataset Scene RCV1 equation. In SIAM Conference on Data Mining (SDM), KCCA 84.0(0.5) 80.0(2.3) 2008. M3L 81.7(0.8) 84.3(1.5) Golub, G.H., Nash, S., and Van Loan, C.F. A hessenberg- SMSE2 55.4(0.5) 52.5(0.7) schur method for the problem AX+XB=C. CNMF 52.7(5.5) 50.8(1.4) IEEE Transactions on Automatic Control,AC-24:909– VVMR 84.9!(2.3) 85.2!(1.4) 913, 1979. Hariharan, B., Zelnik-Manor, L., Vishwanathan, S. V. N., and Varma, M. Large scale max-margin multi-label clas- sification with priors. In Proceedings of the International 7. Conclusion Conference on Machine Learning,2010. We consider this paper as a preliminary foray into Lewis, D. D., Yang, Y., and Li, F. Rcv1: A new benchmark more widespread adoption of vector-valued RKHS for- collection for text categorization research. Journal of malism for structured semi-supervised problems. Our Machine Learning Research,5:361–397,2004. results confirm that output dependencies and data ge- Liu, Y., Jin, R., and Yang, L. Semi-supervised multi-label ometry can both be exploited in a complementary fash- learning by constrained non-negative matrix factoriza- ion. Our choice of the kernel was dictated by certain tion. In The Twentieth Conference on Artificial Intelli- natural regularization properties and computational gence (AAAI),2006. tractability. The theoretical, algorithmic and empir- Micchelli, C. A. and Pontil, M. On learning vector-valued ical study of a wider class of operator-valued kernels functions. Neural Computation,17:177–204,2005. offers a rich avenue for future work. Sch¨olkopf, B. and Smola, A. Learning with kernels: Support Vector Machines, Regularization, Optimization, References and Beyond.TheMITPress,Cambridge,2002. Bakir, G.H., Hofmann, T., Schlkopf, B., Smola, A.J., Sindhwani, V., Niyogi, P., and M.Belkin. Beyond the point Taskar, B., and Vishwanathan, S.V.N (Ed). Predicting cloud: from transductive to semi-supervised learning. In Structured Data.MITPress,2007. Proceedings of the International Conference on Machine Learning,2005. Baldassarre, L., Rosasco, Lorenzo, Barla, Annalisa, and Verri, Alessandro. Vector-field learning with spectral Sun, Liang, Ji, Shuiwang, and Ye, Jieping. Canonical cor- filtering. In Proceedings of the European Conference relation analysis for multilabel classification: A least on Machine Learning and Principles and Practice of squares formulation, extensions and analysis. IEEE Knowledge Discovery in Databases (ECML PKDD),vol- Transactions on Pattern Analysis and Machine Intelli- ume 7, pp. 2399–2434, 2010. gence,1:194–200,2011.