> TNNLS-2012-P-0852 REVISION 2 < 1 Multi-view Vector-valued Manifold Regularization for Multi-label Image Classification Yong Luo, Dacheng Tao, Senior Member, IEEE, Chang Xu, Chao Xu, Hong Liu, Yonggang Wen, Member, IEEE

Abstract—In computer vision, image datasets used for clas- SIFT [3]) and global structure (e.g. GIST [4]) can effectively sification are naturally associated with multiple labels and represent natural substances (e.g. sky, cloud and plant life), comprised of multiple views, because each image may contain man-made objects (e.g. aeroplane, motorbike and TV-monitor) several objects (e.g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e.g. color, texture and and scenes (e.g. seaside and indoor), respectively, but cannot shape). Currently available tools ignore either the label relation- simultaneously illustrate all these concepts in an effective way. ship or the view complementary. Motivated by the success of the Each visual feature encodes a particular property of images vector-valued function that constructs matrix-valued kernels to and characterizes a particular concept (label), so we treat each explore the multi-label structure in the output space, we introduce feature representation as a particular view for characterizing multi-view vector-valued manifold regularization (MV3MR) to integrate multiple features. MV3MR exploits the complementary images. Fig. 1 (a)-(c) indicate that SIFT representation is property of different features and discovers the intrinsic local effective in describing a motorbike and GIST can capture the geometry of the compact support shared by different features global structure of a person on the motorbike. Fig. 1 (d)-(f) under the theme of manifold regularization. We conducted shows that GIST performs well in recognizing seaside scenes, extensive experiments on two challenging, but popular datasets, while the color information can be used as a complementary PASCAL VOC’ 07 (VOC) and MIR Flickr (MIR), and validated the effectiveness of the proposed MV3MR for image classification. aid for recognizing the blue sea water. From Fig. 1 (g)-(i) we can see that RGB usually represents cloud well and GIST is helpful when RGB fails. For example, the RGB represen- Index Terms—Image classification, semi-supervised, multi- label, multi-view, manifold tations of C1 and C3 are not very similar and their GIST distance (0.22) is very small due to the sky scene structure. This multi-view nature distinguishes image classification from I.INTRODUCTION single-view tasks, such as texture segmentation [5] and face Natural image can be summarized by several keywords recognition [6]. A (or labels). To conduct image classification by directly The vector-valued function [7] has recently been introduced using binary classification methods [1], [2], it is necessary to resolve multi-label classification [8] and has been demon- to assume that labels are independent, although most labels strated to be effective in semantic scene annotation. This appearing in one image are related to one another. Examples method naturally incorporates the label-dependencies into the are given in Fig. 1, where A1-A3 shows a “person” rides a classification model by first computing the graph Laplacian “motorbike”, B1-B3 indicates “sea” usually co-occurs with [9] of the output similarity graph, and then using this graph “sky” and C1-C3 shows some “clouds” in the “sky”. This to construct a vector-valued kernel. This model is superior to multi-label nature makes image classification intrinsically dif- most of the existing multi-label learning methods [10], [11], ferent from simple binary classification. [12] because it naturally considers the label correlations and Moreover, different labels cannot be properly characterized efficiently outputs all the predicted labels at one time. arXiv:1904.03921v1 [stat.ML] 8 Apr 2019 by a single feature representation. For example, the color Although the vector-valued function is effective for general information (e.g. color histogram), shape cue (encoded in multi-label classification tasks, it cannot directly handle image classification problems that include images represented by Y. Luo, C. Xu, and C. Xu are with the Key Laboratory of Machine multi-view features. A popular solution is to concatenate all Perception (Ministry of Education), School of Electronics Engineering and the features into a long vector. This concatenation strategy not Computer Science, Peking University, Beijing, 100871, China. D. Tao is with the Centre for Quantum Computation and Intelligent only ignores the physical interpretations of different features, Systems, University of Technology, Sydney, Jones Street, Ultimo, NSW 2007, however, but also encounters the over-fitting problem given Sydney, Australia. limited training samples. H. Liu is with the Engineering Lab on Intelligent Perception for Internet of Things, Shenzhen Graduate School, Peking University, China. (Email: We thus introduce multi-kernel learning (MKL) to the [email protected]) vector-valued function and present a multi-view vector-valued Y. Wen is with the Division of Networks and Distributed Systems, School of manifold regularization (MV3MR) framework for handling Computer Engineering, Nanyang Technological University, Singapore. (Email: [email protected]) the multi-view features in multi-label image classification. c 2013 IEEE. Personal use of this material is permitted. Permission from MV3MR associates each view with a particular kernel, assigns IEEE must be obtained for all other uses, in any current or future media, a higher weight to the view/kernel carrying more discrimina- including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers tive information, and explores the complementary nature of or lists, or reuse of any copyrighted component of this work in other works. different views. > TNNLS-2012-P-0852 REVISION 2 < 2

SIFT GIST RGB SIFT GIST RGB SIFT GIST RGB

A1 0 500 1000 0 256 512 0 2048 4096 B1 0 500 1000 0 256 512 0 2048 4096 C1 0 500 1000 0 256 512 0 2048 4096

A2 0 500 1000 0 256 512 0 2048 4096 B2 0 500 1000 0 256 512 0 2048 4096 C2 0 500 1000 0 256 512 0 2048 4096

A3 0 500 1000 0 256 512 0 2048 4096 B3 0 500 1000 0 256 512 0 2048 4096 C3 0 500 1000 0 256 512 0 2048 4096 A1 A2 A3 A1 A2 A3 A1 A2 A3 B1 B2 B3 B1 B2 B3 B1 B2 B3 C1 C2 C3 C1 C2 C3 C1 C2 C3 A1 0 0.22 0.21 A1 0 0.28 0.26 A1 0 0.41 0.35 B1 0 0.32 0.29 B1 0 0.17 0.20 B1 0 0.43 0.46 C1 0 0.34 0.37 C1 0 0.33 0.22 C1 0 0.24 0.32 A2 0.22 0 0.25 A2 0.26 0 0.25 A2 0.41 0 0.38 B2 0.32 0 0.19 B2 0.17 0 0.15 B2 0.43 0 0.30 C2 0.34 0 0.45 C2 0.33 0 0.32 C2 0.24 0 0.29 A3 0.21 0.25 0 A3 0.24 0.22 0 A3 0.35 0.38 0 B3 0.29 0.19 0 B3 0.20 0.15 0 B3 0.46 0.30 0 C3 0.37 0.45 0 C3 0.22 0.32 0 C3 0.32 0.29 0 (a) (b) (c) (d) (e) (f) (g) (h) (i)

Fig. 1. A1-A3, B1-B3 and C1-C3 are images of person riding a motorbike, seaside and clouds in the sky respectively. Each (a)-(i) contains feature representations and a distance matrix of the samples in a particular view. All the distances have been normalized here.

In particular, MV3MR assembles the multi-view informa- [19] adds the label value to the feature vector and then tion through a large number of unlabeled images to discover applies AdaBoost on weak classifiers. A ranking algorithm the intrinsic geometry embedded in the high dimensional am- is presented in [20] by adopting the ranking loss as the bient space of the compact support of the marginal distribution. cost function in SVM. ML-KNN [21] is an extension of The local geometry, approximated by the adjacency graphs the k-nearest neighbor (KNN) algorithm to deal with multi- induced from multiple kernels of all the corresponding views, label data and canonical correlation analysis (CCA) has also is more reliable than that approximated by the adjacency graph recently been extended to the multi-label case by formulating induced from a particular kernel of any corresponding view. it as a least-squares problem [12]. In this way, MV3MR essentially improves the vector-valued Other works concentrate on preprocessing the data so that function for multi-label image classification. standard binary or multi-class techniques can be utilized. For Because the hinge loss is more suitable for classification example, multiple labels of a sample belong to a subset of than the loss [13], we derive an SVM (support the whole label set and we can view this subset as a new vector machine) formulation of MV3MR which results in a class [22]. This may lead to a large number of classes and a multi-view vector-valued Laplacian SVM (MV3LSVM). We more common strategy is to learn a binary classifier for each carefully design the MV3LSVM algorithm so that it deter- label [1], [2]. Considering that the labels are often sparse, mines the set of kernel weights in the learning process of the a compressed sensing method is proposed for multi-label vector-valued function. prediction [18]. We thoroughly evaluate the proposed MV3LSVM algorithm Various approaches have been proposed to improve predic- on two challenging datasets, PASCAL VOC’ 07 [14] and MIR tion accuracy by exploiting label correlations [23], [24], [11], Flickr [15], by comparing it with a popular MKL algorithm [8]. Sun et al. [23] proposed the construction of a hypergraph [16], a recently proposed MKL method [17], and competitive to exploit the label dependencies. In [24], a common subspace multi-label learning algorithms for image classification, such is assumed to be shared among all labels, and the correlation as multi-label compressed sensing [18], canonical correlation information contained in different labels can be captured analysis [12] and vector-valued manifold regularization [8] in by learning this low-dimensional subspace. A max-margin terms of mean average precision (mAP), mean area under method is proposed in [11], where the prior knowledge of the curve (mAUC) and hamming loss (HL). The experimental label correlations is incorporated explicitly in the multi-label results suggest the effectiveness of MV3LSVM. classification model. The rest of the paper is organized as follows. Section II None of the approaches mentioned above consider the summarizes the recent work in multi-label learning, multi- features to be used; however, an image with multiple labels kernel learning and image classification. In Section III, we usually indicates that it contains multiple objects. As far introduce manifold regularization and its vector-valued gen- as we know, there is no single kind of feature that can eralization. We depict the proposed MV3MR framework and describe a variety of objects very well. Therefore, how to its SVM formulation in Section IV. Extensive experiments are combine different features is a critical issue in multi-label presented in Section V and we conclude this paper in Section image classification and we consider MKL for this purpose VI. in this paper.

II.RELATED WORK B. MKL: Multi-kernel learning A. Multi-label learning Classical kernel methods are usually based on a single Multi-label classification has received intensive attention in kernel [25]. MKL [26], in which a kernel-based classifier recent years. Some methods extend traditional multi-class al- and a convex combination of the kernels are learned si- gorithms to cope with the multi-label problem. AdaBoost.MH multaneously, has attracted much attention. Lanckriet et al. > TNNLS-2012-P-0852 REVISION 2 < 3

[26] introduces MKL for binary classification and solves it (features) contribute equally to label prediction. In contrast with semi-definite programming (SDP) techniques. The MKL to these approaches, the proposed MV3MR naturally explores problem is further developed by Sonnenburg et al. [27] in both the complementary property of multi-view features and the presentation of a semi-infinite linear program (SILP). In the correlations of different labels under the manifold regular- [16], MKL is reformulated by using a weighted L2-norm ization scheme. regularization to replace the mixed-norm regularization and adding an L1-norm constraint on the kernel weights. All of III.MANIFOLD REGULARIZATION AND VECTOR-VALUED these MKL formulations are based on SVM and are not GENERALIZATION naturally designed for multi-label classification. The proposed This section briefly introduces the manifold regularization MV3MR framework extends MKL to handle the multi-label framework [9] and its vector-valued generalization [8]. Given l problem and model label inter-dependencies. a set of l labeled examples Dl = {(xi, yi)i=1} and a relatively N=l+u large set of u unlabeled examples Du = {(xi)i=l+1 }, we consider a non-parametric estimation of a vector-valued C. Image classification function f : X 7→ Y, where Y = Rn and n is the number Image classification has been widely used in many computer of labels. This setting includes Y = R as a special case for vision-related applications such as image retrieval and web regression and classification. content browsing. In recent years, more than a dozen methods have been proposed and representative works can be grouped A. Manifold regularization into three categories. Manifold learning has been widely used for capturing the • Single-view learning for image classification: This cate- local geometry [41] and conducting low-dimensional embed- gory contains many recent image classification schemes, e.g. ding [42], [43]. In manifold regularization, the data mani- dictionary learning [28] and spatial pyramid matching [29]. fold is characterized by a nearest neighbor graph W, which For example, Labusch et al. [30] proposed to integrate sparse- explores the geometric structure of the compact support of coding and local maximum operation to extract local features the marginal distribution. The Laplacian L of W and the for handwritten digit recognition. In [31], a non-linear coding prediction f = [f(x1), . . . , f(xN )] are then formulated as a scheme was introduced for local descriptors such as SIFT. 2 T smoothness constraint kfkI = f Lf, where L = D − W Yang et al. [32] explored the local co-occurrences of visual PN and the diagonal matrix D is given by Dii = j=1 Wij. The words over the spatial pyramid. manifold regularization framework minimizes the regularized • Multi-view learning for image classification: Schemes loss in this category utilize the features from different views (or l 1 X multi-view features) to boost image classification performance. argmin L(f, x , y ) + γ kfk2 + γ kfk2 (1) l i i A k I I In this paper, the concept “views” used for learning refer f∈Hk i=1 to different features or attributes for depicting the objects where L is a predefined ; k is the standard scalar- to be classified. It should be noted that for some other valued kernel, i.e. k : X × X 7→ and H is the associated applications in vision and graphics, the “views” mean different R k reproducing kernel Hilbert space (RKHS). Here, γ and γ spatial viewpoints [33], [34], [35]. A semi-supervised boosting A I are trade-off parameters to control the complexities of f in algorithm is proposed in [36], in which images measured by the ambient space and the compact support of the marginal different views are used to construct a prior and formulate a distribution. The Representer theorem [9] ensures the solution regularization term. Guillaumin et al. [2] combined 15 visual N of problem (1) takes the form f ∗(x) = P α k(x, x ), representations (e.g. SIFT, GIST and HSV) with the tag feature i=1 i i where α ∈ is the coefficient. Since a pair of close samples for semi-supervised image classification. Combining the visual i R means that the corresponding conditional distributions are and textual information has been utilized for clustering [37] similar, the manifold regularization kfk2 helps the function and web page classification [38]. I learning. • Multi-label learning for image classification: This cate- gory is motivated by the success of multi-label learning and B. Vector-valued manifold regularization has demonstrated promising image classification performance. For example, Bucak et al. [39] proposed a ranking based In the vector-valued RKHS, where a kernel function K is algorithm to tackle the multi-label problem with incompletely defined and the corresponding Y-valued RKHS is denoted by labeled data by introducing a group lasso regularizer in op- HK , the optimization problem of the vector-valued manifold timization. Unlike traditional multi-label methods that always regularization (VVMR) is given by consider positive label correlations, a novel approach is pre- l 1 X 2 sented in [40] to make use of the negative relationship of argmin L(f, xi, yi) + γAkfk + γI hf, MfiYu+l (2) l K categories. f∈HK i=1 Although it has been widely acknowledged that both multi- where Yu+l is the u + l-direct product of Y and the inner view representation and label inter-dependencies are important product takes the form for multi-label image classification, most of the existing ap- u+l proaches do not take both of them into consideration. Most X h(y1, . . . , yu+l), (w1, . . . , wu+l)iYu+l = hyi, wiiY existing multi-view approaches assume that different views i=1 > TNNLS-2012-P-0852 REVISION 2 < 4

Vector-valued Kernel Combination

1 animals 0.9

bicycle 0.8

sea 0.7 dog 0.6 1 0.5 sky

0.4 0.5 person 0.3 water

0.2 animals bicycle sea dog sky person water Label Correlation 0.5 0.9 sea, sky

k k 0.8 G1 SIFT Gavg Mean … … 0.7 bicycle, person 0.6 0.6 Labeled Data G1 GV

0.4 1 V 0.5

k k k 0.4 G2 RGB Gcomb Combined Gopt Optimal G Manifold Integration k Fig. 2. Different views contribute to the classification differently. G1 is a k Gram matrix constructed from SIFT [3]. G2 is obtained from RGB color k k Μ1 1 histogram. G1 and G2 are complementary to each other. The learned linear Gk combination comb of the two Gram matrices is closer to the optimal Gram … … k k  Μ matrix Gopt than the mean Gram matrix Gavg by simply averaging the two V kernels. Preserve Locality Unlabeled Data MV u+l The function prediction f = [f(x1), . . . , f(xu+l)] ∈ Y . The matrix M is a symmetric positive operator that satisfies Fig. 3. The diagram of the proposed MV3MR algorithm. The given labels u+l are used to construct an output similarity graph, which encodes the label hy, Myi ≥ 0 for all y ∈ Y and is chosen to be L ⊗ In. correlations. Features from different views of the labeled and unlabeled Here, L is the graph Laplacian, In is the n×n identity matrix data are used to construct different Gram matrices (with label correlations and ⊗ denotes the Kronecker (tensor) matrix product. For Y = incorporated) Gv, v = 1,...,V as well as the different graph Laplacians n Mv, v = 1,...,V . We learn the weight βv for Gv and θv for Mv R , an entry K(xi, xj) of the n×n vector-valued kernel matrix respectively. The combined Gram matrix G is used for classification while is defined by preserving locality on the integrated manifold M.

† K(xi, xj) = k(xi, xj)(γOLout + (1 − γO)In) (3) Fig. 2 gives an illustrative example which suggests that different views contribute to the classification differently, and where k(·, ·) is a scalar-valued kernel, γO ∈ [0, 1] is a pa- † that learning the combination coefficients to integrate different rameter. Here, Lout is the pseudo-inverse of the output labels’ views benefits the classification. Given five images from two graph Laplacian. The graph can be estimated by viewing each classes, namely three cars of different colors (silvery white, label as a vertex and using the nearest neighbors method. The blue and red) and two different sky images, the optimal Gram representation of the j’th label is the j’th column in the label Gk N×n matrix opt is shown on the right side for separating these matrix Y ∈ R , in which, Yij = 1 if the j’th label is images into two classes. On the left, there are four Gram manually assigned to the i’th sample, and −1 otherwise. For k k matrices, which are two single Gram matrices G1 , G2 obtained the unlabeled samples, Yij = 0. k from two different views, and their mean Gavg, as well as It has been proved in [8] that the solution of the minimiza- their linear combination Gk with the learned coefficients. ∗ PN comb tion problem (2) takes the form f (x) = i=1 K(x, xi)ai. k The figure indicates that Gcomb is closer to the optimal Gram By choosing the Regularization Least Squares (RLS) loss Gk Gk 2 matrix opt than avg. L(f, xi, yi) = (f(xi) − yi) , we can estimate the column n(u+l) Given a small number of labeled samples and a relative vector a = {a1, . . . , au+l} ∈ R with each ai ∈ Y by large number of unlabeled samples, MV3MR first computes solving a Sylvester Equation: an output similarity graph by using the label information of

1 N k k 1 the labeled samples. The Laplacian of the label graph is incor- − (Jl G + lγI LG )AQ − A + Y = 0, (4) k lγA lγA porated in the scalar-valued Gram matrix Gv over labeled and unlabeled data to enforce label correlations on each view, and where a = vec(AT ) and Q = (γ L† +(1−γ )I ) and J N k O out O n l the vector-valued Gram matrices Gv = Gv ⊗ Q, v = 1,...,V is a diagonal matrix with the first l entries 1, and the others can be obtained. Meanwhile, we also compute the vector- k 0. Here, G is the Gram matrix of the scalar-valued kernel valued graph Laplacians Mv, v = 1,...,V by using the k over the labeled and unlabeled data. We refer to [8] for a features of the input data from different views. Then MV3MR detailed description of the vector-valued Laplacian RLS. learns the kernel combination coefficient βv for Gv as well as the graph weight θv for Mv by the use of alternating IV. MV3MR:MULTI-VIEW VECTOR-VALUED MANIFOLD optimization. Finally, the combined Gram matrix G together REGULARIZATION with the regularization on the combined manifold M is used for classification. Fig. 3 summarizes the above procedure. To handle multi-view multi-label image classification, we Technical details are given below. generalize VVMR and present multi-view vector-valued man- ifold regularization (MV3MR). In contrast to [2], which as- sumes that different views contribute equally to the clas- A. Rationality sification, MV3MR assumes that different views contribute Let V be the number of views and v be the view index. On to the classification differently and learns the combination the feature space of each view, we define the corresponding coefficients to integrate different views. positive definite scalar-valued kernel kv, which is associated > TNNLS-2012-P-0852 REVISION 2 < 5

with an RKHS Hkv . It follows from the functional framework where ai ∈ Y, 1 ≤ i ≤ N = u + l are some vectors to be PV [16] that by introducing a non-negative coefficient βv, the estimated and K(x, xi) = v=1 βvKv(x, xi). The proof of kfkH 0 kv Hilbert space H = {f|f ∈ Hk : < ∞} is an Lemma 1 and Theorem 1 are detailed in the appendix. kv v βv 0 0 RKHS with kernel k(x, x ) = βvkv(x, x ). If we define Hk The hinge loss L(f, xi, yi) = (1 − yif(xi))+ is more as the direct sum of the space H0 , i.e. H = ⊕V H0 , then suitable for classification than least squares loss since the hinge kv k v=1 kv Hk is an RKHS associated with the kernel loss results in a better convergence rate and usually higher classification accuracy; we refer to [13] for a comparison of V 0 X 0 different popular loss functions. We adopt the hinge loss in k(x, x ) = βvkv(x, x ) (5) MV3MR and derive MV3LSVM as follows. v=1

Thus, any function in Hk is a sum of functions belonging to C. Multi-view vector-valued Laplacian SVM 0 0 Hkv . The vector-valued kernel K(x, x ) = k(x, x ) ⊗ Q = PV 0 Under the SVM formulation, the minimization problem of βvKv(x, x ), where we have used the bilinearity of v=1 MV3MR is l n the Kronecker product. Each K (x, x0) = k (x, x0) ⊗ Q 1 X X v v argmin (1 − y f (x )) + γ kfk2 corresponds to an RKHS according to the study of RKHS nl ij j i + A K f∈HK ,β,θ i=1 j=1 for the vector-valued functions [8]. Thus, the kernel K is 2 2 + γI hf, MfiYu+l + γBkβk + γC kθk associated with an RKHS HK . This functional framework 2 2 3 V V motivates the MV MR framework. We will jointly learn the X X s.t. β = 1, β ≥ 0, and θ = 1, θ ≥ 0, ∀v, linear combination coefficients {βv} to integrate kernels for v v v v characterizing different views and the classifier coefficients v=1 v=1 (8) {ai} in a single optimization problem. Moreover, to effectively utilize the unlabeled data, we construct graph Laplacians for An unregularized bias bj is often added to the solution fj(x) = PN different views and learn to combine all of them. i=1 Kj(x, xi)ai in the SVM formulation. By substituting (7) into the above formulation, we can see the primal problem as B. Problem formulation follows: l n Under the multi-view setting and the theme of manifold 1 X X argmin ξ + γ aT Ga + γ aT GMGa regularization, we propose to learn the vector-valued function nl ij A I a,b,ξ,β,θ i=1 j=1 f by linearly combining the kernels and graphs from different 2 2 views. The optimization problem is given by + γBkβk2 + γC kθk2 l+u l X s.t. yij( Kj(xi, xz)az + bj) ≥ 1 − ξij, ξij ≥ 0, ∀i, j 1 X 2 argmin L(f, xi, yi) + γAkfkK + γI hf, MfiYu+l z=1 f∈H l K i=1 V V 2 2 X X + γBkβk2 + γC kθk2 βv = 1, βv ≥ 0, and θv = 1, θv ≥ 0, ∀v, V v=1 v=1 X (6) (9) s.t. βv = 1, βv ≥ 0, v=1 where G = PV β G is the combined vector-valued Gram V v=1 v v X θ = 1, θ ≥ 0, v = 1,...,V, matrix over the labeled and unlabeled samples defined on v v kernel K, M = PV θ M is the integrated vector-valued v=1 v=1 v v graph Laplacian. Here, Kj(·, ·) is the jth row of the vector- T T where β = [β1, . . . , βV ] and θ = [θ1, . . . , θV ] . Both valued kernel K. We have three variables, i.e., a, β and θ, to γB > 0 and γC > 0 are trade-off parameters. The decision be optimized in (9). To solve this problem, we consider the P v function takes the form f(x)+b = v f (x)+b and belongs following constrained optimization problem: 0 to an RKHS HK associated with the kernel K(x, x ) = P 0 P minF (β, θ) βvKv(x, x ). We define M = θvMv, where each Mv v v V V is a vector-valued graph Laplacian constructed on HK . It can X X (10) v s.t. β = 1, β ≥ 0, and θ = 1, θ ≥ 0, ∀v. be demonstrated that M is still a graph Laplacian. v v v v + v=1 v=1 Lemma 1: M ∈ SNn is a vector-valued graph Laplacian. + where F (β, θ) equals to The notation Sn denotes a set of n × n symmetric positive ∗ semi-definite matrices and we will use Sn to denote a set of  l n  1 X X T T positive definite matrices. Then we have the following version  argmin ξij + γAa Ga + γI a GMGa  a,b,ξ nl of the Representer Theorem.  i=1 j=1  2 2 Theorem 1: For fixed sets of {βv} and {θv}, the minimizer  + γ kβk + γ kθk B 2 C 2 (11) of problem (6) admits an expansion l+u  X  s.t. yij( Kj(xi, xz)az + bj) ≥ 1 − ξij, u+l  ∗ X  z=1 f (x) = K(x, xi)ai, (7)   i=1 ξij ≥ 0, i = 1, . . . , l, j = 1, . . . , n, > TNNLS-2012-P-0852 REVISION 2 < 6

Here, G and M take the form as in (9). We can omit the For fixed θ, the above problem can be rewritten with respect 2 2 terms γBkβk2 and γC kθk2 in (11) since and are fixed. By to β as introducing the Lagrange multipliers µ and η in (11), we ij ij T 2 T have W (β) =β Hβ + γBkβk2 − h β X (17) W (a, ξ, b, µ, η) s.t. βv = 1, β ≥ 0, v = 1,...,V, l n l n v 1 X X 1 T X X = ξij + a (2γAG + 2γI GMG)a − ηijξij T ∗ T T ∗ nl 2 where h = [h1, . . . , hV ] with each hv = (a ) GvJ Ydµ − i=1 j=1 i=1 j=1 ∗ T ∗ γA(a ) Gva and H is a V × V matrix with the entry l n l+u ∗ T ∗ X X  X   Hij = γI (a ) GiMGja . We can simply set the derivative − µij yij Kj(xi, xz)az + bj − 1 + ξij . T −1 of W (β) to zero and obtain β = (H + H + 2γBI) h. i=1 j=1 z=1 Then the computed β is projected to the positive simplex (12) to satisfy the summation and positive constraints. However, By taking the partial derivative w.r.t. ξij, bj, and setting them such an approach lacks convergence guarantees and may lead to be zero, we obtain to numerical problems. A coordinate descent algorithm is l therefore used to solve (17). In each iteration round during ∂W X = 0 ⇒ µ y = 0, j = 1, . . . , n, the coordinate descent procedure, two elements βi and βj are ∂b ij ij j i=1 selected to be updated while the others are fixed. By using ∂W 1 1 the Lagrangian of problem (17) and considering that βi + βj = 0 ⇒ − µij − ηij = 0 ⇒ 0 ≤ µij ≤ . PV ∂ξij nl nl will not change due to constraint v=1 βv = 1, we have the following solution for updating β and β A reduced Lagrangian can be obtained by substituting the i j above equalities back into (12), which leads to  2γ (β + β ) + (h − h ) + 2t  β∗ = B i j i j ij , 1   i 2(H − H − H + H ) + 4γ W R(a, µ) = aT 2γ G + 2γ GMG a − aT GJ T Y µ + µT 1, ii ji ij jj B (18) 2 A I d  ∗ ∗ βj =βi + βj − βi , l X s.t. µ y = 0, j = 1, . . . , n, P ij ij where tij = (Hii − Hji − Hij + Hjj)βi − k(Hik − Hjk)βk. i=1 ∗ ∗ The obtained βi or βj may violate the constraint βv ≥ 0. 1 0 ≤ µ ≤ , i = 1, . . . , l, j = 1, . . . , n, Thus, we set ij nl ∗ ∗ (13) βi = 0, βj = βi + βj, if 2γB(βi + βj) + (hi − hj) + 2tij ≤ 0, (nl)×(nl+nu) β∗ = 0, β∗ = β + β , if 2γ (β + β ) + (h − h ) + 2t ≤ 0. where J = [I 0] ∈ R and I is an nl × j i i j B i j j i ji nl nl identity matrix. Here, µ = {µ1, . . . , µl} ∈ R is T From the solution (18), we can see that the update criteria a column vector with each µi = [µi1, . . . , µin] , Yd = tends to assign larger value βi to larger hi and smaller diag(y11, . . . , y1n, . . . , . . . , yl1, . . . , yln) and 1 is an all ones ∗ T T ∗ T ∗ R Hii. Because hi = (a ) GiJ Ydµ − γA(a ) Gia and column vector. Taking the partial derivative of W w.r.t. a ∗ T ∗ Hii = γI (a ) GiMGia measures the discriminative ability and letting it be zero leads to: ∗ ∗ and the performance of the i’th view. Let (ai , µi ) be the ∗ −1 T ∗ a = (2γAI + 2γI MG) J Ydµ . (14) solution for the optimization problem of the i’th view, which R is W (a, µ) with G = Gi. If all the solutions are the Substituting it back into (13) we get: ∗ ∗ ∗ ∗ ∗ ∗ same, i.e., (a1, µ1) = ... = (aV , µV ) = (a , µ ), then the 1 objective value W R(a∗, µ∗) of the discriminative view tends µ∗ = argmax µT 1 − µT Sµ i i µ∈ nl 2 to be smaller than non-discriminative view (we assume that all R R ∗ ∗ l Gram matrices have been normalized). A smaller W (ai , µi ) X s.t. µijyij = 0, j = 1, . . . , n, (15) corresponds to a larger hi and a smaller Hii, and thus our i=1 algorithm prefers discriminative view. However, the solutions ∗ ∗ ∗ ∗ ∗ ∗ 1 (a1, µ1),..., (aV , µV ) may not exactly the same as (a , µ ). 0 ≤ µij ≤ , i = 1, . . . , l, j = 1, . . . , n, nl Thus the learned βi is in general but not strictly consistent −1 T with the performance of the i’th single view. We can see this where the matrix S = YdJG(2γAI + 2γI MG) J Yd. PV in the experiments. Again, the combined Gram matrix G = v=1 βvGv and the PV For fixed β, the problem (16) can be simplified as integrated graph Laplacian M = v=1 θvMv. Because of the strong duality, the objective value of problem (11) is also T 2 W (θ) =s θ + γC kθk R ∗ ∗ 2 the objective value of (13), which is W (a , µ ). Therefore, X (19) we can rewrite (10) as s.t θv = 1, θv ≥ 0, v = 1,...,V, v R ∗ ∗ 2 2 W (β, θ) =W (a , µ ) + γBkβk + γC kθk 2 2 T ∗ T ∗ V V where s = [s1, . . . , sV ] with each sv = γI (a ) GMvGa . X X (16) Similarly, the solution of (19) can be obtained by using the s.t. βv = 1, βv ≥ 0; θv = 1, θv ≥ 0, ∀v. v=1 v=1 coordinate descent and the criteria for updating θi and θj in > TNNLS-2012-P-0852 REVISION 2 < 7

Algorithm 1 The optimization procedure of the proposed T T Hij = Hij = γI a GjMGia = Hji. In addition, the MV3LSVM algorithm T + Cholesky decomposition M = P P exists since M ∈ SNn. v v l T + Input: labeled data Dl = {(xi , yi)i=1} and unlabeled data Let zi = PGia, we have Hij = γI zi zj. Thus, H ∈ SV and v v N=u+l ∗ Du = {(xi )i=l+1 } form different views, v = 1,...,V He(β) ∈ SV for γB > 0. This means that (17) is also strictly is the view index convex. Algorithm parameters: γA, γI , γB and γC Finally, it is straightforward to verify that the problem (19) Output: classifier variable a, the kernel combination coeffi- is strictly convex for γC > 0. This completes the proof. cients {βv}, and the graph Laplacian weights {θv} Now we discuss the convergence of our algorithm. Let the 1: Construct the Gram matrix Gv and the vector-valued objective function of problem (9) be R(a, b, ξ, β, θ) and the k k k k k graph Laplacian Mv for each view, set βv = θv = initialized value be R(a , b , ξ , β , θ ). Since the problem 1 PV k+1 k+1 k+1 k k V , v = 1,...,V ; compute G = v=1 βvGv and M = (11) is convex, we have R(a , b , ξ , β , θ ) ≤ PV R(ak, bk, ξk, βk, θk). We suppose that problem (9) is v=1 θvMv. 2: Iterate exactly solved, which means that the duality gap is k+1 k+1 k+1 k k k k 3: Approximately solve for a through (14) with fixed G zero. Then R(a , b , ξ , β , θ ) = W (β , θ ). and M For fixed θk, we obtain the convex problem (17), k+1 k+1 k+1 k+1 k 4: Compute β by solving (17) and update the Gram matrix thus we have R(a , b , ξ , β , θ ) ≤ G R(ak+1, bk+1, ξk+1, βk, θk). Similarly, due to the convexity k+1 k+1 k+1 k+1 k+1 5: Compute θ by solving (19) and update the graph of problem (19), we have R(a , b , ξ , β , θ ) ≤ Laplacian M R(ak+1, bk+1, ξk+1, βk+1, θk). Therefore, the convergence of 6: Until convergence our algorithm is guaranteed.

E. Complexity analysis an iteration round is given by For the proposed MV3LSVM, the complexity is dominated  ∗ ∗ by the time cost of computing a∗ in each iteration, where  θi = 0, θj = θi + θj, if 2γC (θi + θj) + (sj − si) ≤ 0,  ∗ ∗ the computation of the matrix S in (15) involves an inversion  θ = 0, θ = θi + θj, if 2γC (θi + θj) + (si − sj) ≤ 0, j i (20) and several multiplications of nN × nN matrix, and the time 2γ (θ + θ ) + (s − s )  ∗ C i j j i ∗ ∗ complexity is O(n2.8N 2.8) using the Strassen algorithm [44].  θi = , θj = θi + θj − θi , else. 4γC Problem (15) can be solved using a standard SVM solver with We now summarize the learning procedure of the proposed the time complexity O(n2.3l2.3) according to the sequential multi-view vector-valued Laplacian SVM (MV3LSVM) in minimal optimization (SMO) [45]. The computations of β and Algorithm 1. θ are quite efficient since their dimensionality is V , which is The stopping criterion for terminating the algorithm can be usually very small (e.g., V = 7 in our experiments). Suppose R 2 3 the difference of the objective value, W (a, µ) + γBkβk2 + the number of iterations is k, then the total cost of MV LSVM 2 2.8 2.8 2.3 2.3 γC kθk2 between two consecutive steps. Alternatively, we can is O(k(n N +n l )). Considering that l < N, thus the stop the iterations when the variation of β and θ are both time cost is O(kn2.8N 2.8), which is k times of the case that smaller than a pre-defined threshold. Our implementation is no combination coefficients (β and θ) are learned. From the based on the difference of the objective value, i.e., if the value experimental results shown in Section V-B, we will find that k |Ok −Ok−1|/|Ok −O0| is smaller than a predefined threshold, is very small since our algorithm only needs a few iterations then the iteration stops, where Ok is the objective value of (around five) to converge. Actually, there is a balance between the kth iteration step. Our implementation is based on the the time complexity and classification accuracy. If only limited difference of the objective value. number of unlabeled samples are selected to construct the input graph Laplacians, i.e., N = u + l is small. Then the D. Convergence analysis time complexity can be reduced with acceptable performance In this section, we discuss the convergence of the proposed sacrifice. In our experiments, we obtain satisfactory accuracy N = 1000 MV3LSVM algorithm. We firstly prove the convexity of the by setting , and the time cost is acceptable. problem (11), (17) and (19) as follows. Proof: The Hessian matrix of the objective function V. EXPERIMENT 3 of (11) is He(a) = γAG + γI GMG. The Gram matrix We validate the effectiveness of MV LSVM on two chal- + G ∈ Sn and we assume that G is positive definite in this lenge datasets, PASCAL VOC’ 07 (VOC) [14] and MIR Flickr paper (to enforce this property a small ridge is added to the (MIR) [15]. The VOC dataset contains 10,000 images labeled diagonal of G). The second term is positive semi-definite since with 20 categories. The MIR dataset consists of 25,000 images xT GMGx = zT Mz ≥ 0 for any x and z = Gx. Here, we of 38 concepts. For the PASCAL VOC’07 dataset [14], we + have used the property of the graph Laplacian M ∈ SNn. use the standard train/test partition [14], which splits 9,963 ∗ Then He(a) ∈ SNn for γA > 0 and problem (11) is strictly images into a training set of 5,011 images and a test set of convex. 4,952 images. For the MIR Flickr dataset [15], images are For the problem (17), the Hessian matrix is He(β) = randomly split into equally sized training and test sets. For H + γBI. The matrix H is symmetric since the element both datasets, we randomly select twenty percent of the test > TNNLS-2012-P-0852 REVISION 2 < 8

0.6 0.9 0.19 VVLSVM 0.55 0.875 0.175 MV3LSVM1 3 0.5 0.85 0.16 MV LSVM2

0.45 0.825 0.145 RL mAP mAUC 0.4 VVLSVM 0.8 VVLSVM 0.13 3 3 0.35 MV LSVM1 0.775 MV LSVM1 0.115 MV3LSVM2 MV3LSVM2 0.3 0.75 0.1 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 #{labeled samples} #{labeled samples} #{labeled samples}

0.48 0.82 0.16 VVLSVM 0.45 0.8 0.15 MV3LSVM1 3 0.42 0.78 0.14 MV LSVM2

0.39 0.76 0.13 RL mAP mAUC 0.36 VVLSVM 0.74 VVLSVM 0.12 3 3 0.33 MV LSVM1 0.72 MV LSVM1 0.11 MV3LSVM2 MV3LSVM2 0.3 0.7 0.1 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 #{labeled samples} #{labeled samples} #{labeled samples}

Fig. 4. The mAP, mAUC and RL (left to right) performance enhancement by learning the weights (β and θ) for different views: PASCAL VOC’ 07 (top) and MIR Flickr (bottom). MV3LSVM1: only learn β; MV3LSVM2: learn both β and θ. images for validation and the rest for testing. The parameters than the negative ones. The traditional AP is defined as of all the algorithms compared in our experiments are tuned P P(k) by using the validation set. This means that the parameters AP = k , corresponding to the best performance in the validation set are #{positive samples} used for the transductive inference and inductive test. From the training examples, 10 random choices of l ∈ {100, 200, 500} where k is a rank index of a positive sample and P(k) is the labeled samples are used in our experiments. precision at the cut-off k. In this paper, we choose to use the We use several visual views and the tag feature according computing method as in the PASCAL VOC [14] challenge to [2]. The visual views include SIFT features [3], local hue evaluation, i.e. histograms [46], global GIST descriptor [4] and some color 1 X AP = P(r), histograms (RGB, HSV and LAB). The local descriptors (SIFT 11 r and hue) are computed densely on the multi-scale grid and quantized using k-means, which will result in a visual word where P(r) is the maximum precision over all recalls larger histogram for each image. Therefore, we have 7 different than r ∈ {0, 0.1, 0.2,..., 1.0}. A larger value means a higher representations in total. We pre-compute a scalar-valued Gram performance. In this paper, the mean AP, i.e. mAP over all matrix for each view and normalize it to unit trace. For the labels, is reported to save space. visual representations, the kernel is defined by • Area Under ROC Curves (AUC) evaluates the probability that a positive sample will be ranked higher than a negative one −1  k(xi, xj) = exp − λ d(xi, xj) , by a classifier [48]. It is computed from an ROC curve, which depicts relative trade-offs between true positive (benefits) and d(x , x ) x x λ = where i j denotes the distance between i and j, false positive (costs). The AUC of a realistic classifier should max d(x , x ) L1 i,j i j . Following [2], we choose the distance be larger than 0.5. We refer to [48] for a detailed description. for the color histogram representations (RGB, HSV and LAB), A larger value means a higher performance. Similar to AP, the L2 χ2 and for GIST and for the visual word histograms (SIFT mean AUC, i.e. mAUC over all labels, is reported. k(x , x ) = xT x and hue). For the tag features, a linear kernel i j i j • Ranking Loss (RL) evaluates the fraction of label pairs that is constructed. are incorrectly ranked [19], [21]. Given a sample xi and its label set Yi, a successful classifier f(x, y) should have larger value for y ∈ Y than those y 6∈ Y . Then the ranking loss for A. Evaluation metrics i i the ith sample is defined as: We use three kinds of evaluation criteria. The average precision (AP) and area under ROC curves (AUC) are utilized RL(f, xi) = to evaluate the ranking performance under each label. We also 1 |{(y1, y2)|f(x, y1) ≤ f(x, y2), y1 ∈ Yi, y2 6∈ Yi}|, use the ranking loss (RL) to study the performance of label |Yi|(P − |Yi|) set prediction for each instance. • Average Precision (AP) evaluates the fraction of samples where P is the total number of labels and | · | denotes the ranked above a particular positive sample [47]. For each label, cardinality of a set. The smaller the value, the higher the there is a ranked sequence of samples returned by the classifier. performance. The mean value over all samples is computed A good classifier will rank most of the positive samples higher for evaluation. > TNNLS-2012-P-0852 REVISION 2 < 9

550 900 0.9 0.9

540 890 0.75 0.75 mAUC mAP mAUC 530 880 0.6 RL 0.6 mAP RL 0.45 0.45

Obj. value 520 Obj. value 870 Performance Performance

510 860 0.3 0.3

500 850 0.15 0.15 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Iterations Iterations 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Randomly Initialized (β, θ) Index Randomly Initialized (β, θ) Index

Fig. 5. Behavior of the objective values by increasing the iteration number Fig. 6. Performance in mAP, mAUC, and RL vs. randomly initialized (β,θ). on the two datasets. Left: PASCAL VOC’ 07; Right: MIR Flickr. The proposed model is insensitive to different initializations of (β,θ). Left: PASCAL VOC’ 07; Right: MIR Flickr. B. Performance enhancement with multi-view learning 0.8 0.8 0.7 0.7

0.6 0.6

It has been shown in [8] that VVMR performs well for trans- 0.5 0.5 ductive semi-supervised multi-label classification and can pro- 0.4 0.4

0.3 0.3 APs Teston Data vide a high-quality out-of-sample generalization. The proposed APs Teston Data 0.2 0.2 3 MV MR framework is a multi-view generalization of VVMR 0.1 0.1

0 0 that incorporates the advantage from MKL for handling multi- 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 APs on Unlabeled Data APs on Unlabeled Data view data. Therefore, we first evaluate the effectiveness of learning the view combination weights using the proposed Fig. 7. Transductive and Inductive APs across outputs on VOC (Top) and multi-view learning algorithm for transductive semi-supervised MIR (Bottom) datasets. multi-label classification. An out-of-sample evaluation will be This is because the multi-view learning actually helps to ap- presented in the next subsection. The experimental setup of proximate the underlying data distribution. This approximation the two compared methods is given as follows. can be steadily improved with the increase of the number of • VVLSVM: vector-valued Laplacian SVM, which is an SVM labeled samples, and thus the significance of the multi-view implementation of the vector-valued manifold regularization learning to the approximation gradually decreases. Besides, framework that exploits both the geometry of the input data we observe that β has more influence on the final performance as well as the label correlations. We do not use the vector- overall. valued Laplacian RLS presented in [8] for comparison be- We show the behavior of the objective values by increasing cause the hinge loss is more suitable for classification. The the iteration number in Fig. 5. From the figure, we can see parameters γ and γ in (2) are both optimized over the set A I that only a few iterations (about five) are necessary to obtain {10i|i = −8, −7,..., −2, −1}. We set the parameter γ in O a satisfactory solution. Thus the time complexity is only a (3) to 1.0 since it has been demonstrated empirically in [8] little more than the VVLSVM algorithm and can justify the that with a larger γ , the performance will usually be better. O performance enhancement. The mean of the multiple Gram matrices and input graph Finally, our algorithm is not sensitive to different initializa- Laplacians are pre-computed for experiments. The number tions, as shown in Fig. 6. In particular, we run our algorithm of nearest neighbors for constructing the input and output with 10 random choices of β and θ. We show the performance graph Laplacians are tuned on the sets {10, 20,..., 100} and in terms of mAP, mAUC and RL on the two datasets in Fig. {2, 4,..., 20} respectively. 6. It can be observed that the performance curves do not vary • MV3LSVM: an SVM implementation of the proposed a lot with different initializations. MV3MR framework that combines multiple views by con- structing kernels for all views and learning their weights. We tune the parameters γ and γ as in VVLSVM and γ is set to A I O C. Out-of-sample generalization 1.0. The additional parameters γB and γC are optimized over {10i|i = −8, −7,..., −2, −1}. We firstly only learn kernel The second set of experiments is to evaluate the out- combinations β and set the graph weights θ to be uniform of-sample extension quality of the MV3MR framework and (MV3LSVM1 in Fig. 4). Then we learn both β and θ in the SVM implementation is utilized. Fig. 7 compares the MV3LSVM2. We use 20 and 6 nearest neighbor graphs to transductive performance to the inductive performance when construct the input and output normalized graph Laplacians using l = 200 labeled samples. We show a scatter plot of respectively for the VOC dataset, while 30 and 8 nearest the AP scores for each label on the two datasets by using 10- neighbor graphs are used in the experiments on MIR. We set random choices of labeled data. We can see that our algorithm these hyperparameters to be the same as those in VVLSVM generalizes well from the unlabeled set to the unseen set. The and no further optimization was attempted. MV3MR framework inherits a strong natural out-of-sample The experimental results on the two datasets are shown in generalization ability that many semi-supervised multi-label Fig. 4. We can see that learning the combination weights using methods do not naturally have [8]. Besides, most graph- our algorithm is always superior to simply using the uniform based semi- algorithms are transductive weights for different views. We also find that when the number and additional induction schemes are necessary to handle new of labeled samples increases, the improvement becomes small. points [49]. > TNNLS-2012-P-0852 REVISION 2 < 10

MV3LSVM β MV3LSVM θ VVLSVM mAP MV3LSVM β MV3LSVM θ VVLSVM mAP D. Analysis of the combination coefficients in multi-view 0.5 0.5 0.5 0.5 learning 0.4 0.4 0.4 0.4 In the following, we present empirical analyses of the multi- 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 view learning procedure. In Fig. 8, we select l = 200 and 0.1 0.1 0.1 0.1 present the view combination coefficients β and θ learned 0 0 0 0 by MV3LSVM, together with the mAP by using VVLSVM DenseHue DenseSift Gist Hsv Lab Rgb Tag DenseHue DenseSift Gist Hsv Lab Rgb Tag for each view. From the results, we find that the tendency Fig. 8. The view combination weights β and θ learned by MV3MR, as of the kernel and graph weights are both consistent with the well as the mAP of using VVLSVM for each view; Top: PASCAL VOC’ 07; corresponding mAP in general, i.e., the views with a higher Bottom: MIR Flickr. classification performance tend to be assigned larger weights, taking the DenseSIFT visual view (the 2nd view) and the tag (the last view) for example. However, a larger weight to the discussion presented in [12], the time complexity is may sometimes be assigned to a less discriminative view; for O(n2l + kn(3l + 5d + 2dl)), where d is the feature dimen- example, the weight of Hsv (the 4th view) is larger than the sionality and k is the number of iterations. weight of DenseSift (the 2nd view). This is mainly because • SimpleMKL [16]: a popular SVM-based MKL algorithm the coefficient a is not optimal for every single view, in which that determines the combination of multiple kernels by a only Gv and Mv are utilized. The learned a minimizes the reduced gradient descent algorithm. The penalty factor C optimization problem (8) by using the combined Gram matrix is tuned on the set {10i|i = −1, 0,..., 7, 8}. We apply G and integrated graph Laplacian M, which means that the SimpleMKL to multi-label classification by learning a bi- learned vector-valued function is smooth along the combined nary classifier for each label. According to the Algorithm RKHS and the integrated manifold. In this way, the proposed 1 presented in [16], there is an outer loop for updating the algorithm effectively exploits the complementary property of kernel weights, as well as an inner loop to determine the different views. maximal admissible step size in the reduced gradient descent. Suppose the number of outer and inner iterations are k1 and E. Comparisons with multi-label and multi-kernel learning k2 respectively, then the time complexity of SimpleMKL is 2.3 algorithms approximately O(nk1k2l ), where we have ignored the time 3 cost of the SVM solver in the inner loop since it has warm Our last set of experiments is to compare MV LSVM with start and can be very fast [16]. several competitive multi-label methods as well as a well- • LpMKL [17]: a recent proposed MKL algorithm, which known MKL algorithm in predicting the unknown labels of extend MKL to lp-norm with p ≥ 1. The penalty factor C is the unlabeled data. The out-of-sample generalization ability of tuned on the set {10i|i = −1, 0,..., 7, 8} and we choose the our method has been verified in our second set of experiments. p {1, 8/7, 4/3, 2, 4, 8, 16, ∞} 3 norm from the set . According We specifically compare MV LSVM with the following to the Algorithm 1 presented in [17], the time complexity is methods on the challenging VOC and MIR datasets: O(nkl2.3) since the kernel combination coefficients can be • SVM CAT: concatenating the features of each view and computed analytically, where k is the number of iterations. then running standard SVM. The parameter C is tuned on the set {10i|i = −1, 0,..., 7, 8}. The time complexity is O(nl2.3) The performance of the compared methods on the VOC [45]. dataset and MIR dataset are reported in Table I. The values in • SVM UNI: combining different kernels by combining them the last column of Table I are average ranks. From the results, with uniform weights and then running standard SVM. The we firstly observe that the performance keeps improving with parameter C is tuned on the set {10i|i = −1, 0,..., 7, 8}. the increasing number of the labeled samples. Second, the The time complexity is O(nl2.3) [45]. performance of the simpleMKL algorithm, which learns the • MLCS [18]: a multi-label compressed sensing algorithm that kernel weights for SVM, can be inferior to the multi-label takes advantage of the sparsity of the labels. We choose the algorithms with the mean kernel in many cases. MV3LSVM label compression ratio to be 1.0 since the number of the labels is superior to multi-view (SimpleMKL and LpMKL) and n is not very large. The mean of the multiple kernels from multi-label algorithms in general and consistently outperforms different views is pre-computed for experiments. Suppose the other methods in terms of mAP. The average rank of our length the compressed label vector (for each sample) is r ≤ n. algorithm is smaller than all the other methods in terms of Then the training cost is O(nl3) if we choose the regression all the three criteria. According to the Friedman test [52], algorithm to be the least squares [50], and the reconstruction the statistics FF of mAP, mAUC and RL are 56.05, 3.03, complexity is O(l(n3 + rn2)) if the least angle regression and 5.69 respectively. All of them are larger than the critical (LARS) algorithm [51] is utilized. Considering that r ≤ n ≤ l value F (6, 30) = 2.42, so we reject the null-hypothesis (the in this paper, the time complexity of MLCS is O(nl3). compared algorithms perform equally well). In particular, by • KLS CCA [12]: a least-squares formulation of the ker- comparing with SimpleMKL, we obtain a significant 8.1% nelized canonical correlation analysis for multi-label classi- mAP improvement on VOC when using 100 labeled samples. fication. The ridge parameter is chose from the candidate set The level of improvement drops when more labeled samples {0, 10i|i = −3, −2,..., 2, 3}. The mean of multiple Gram are available, for the same reason described in our first set of matrices is pre-computed to run the algorithm. According experiments. > TNNLS-2012-P-0852 REVISION 2 < 11

TABLE I PERFORMANCE EVALUATION ON THE TWO DATASETS

VOC mAP ↑ vs. #{labeled samples} MIR mAP ↑ vs. #{labeled samples} Methods 100 200 500 100 200 500 Ranks SVM CAT 0.241±0.011 (7) 0.288±0.013 (7) 0.371±0.007 (7) 0.281±0.009 (7) 0.306±0.007 (7) 0.352±0.008 (7) 7 SVM UNI 0.347±0.018 (4.5) 0.424±0.014 (5) 0.529±0.006 (5) 0.302±0.011 (5) 0.336±0.013 (6) 0.400±0.009 (6) 5.25 MLCS [18] 0.332±0.017 (6) 0.412±0.016 (6) 0.525±0.007 (6) 0.289±0.010 (6) 0.342±0.011 (5) 0.424±0.010 (5) 5.67 KLS CCA [12] 0.347±0.019 (4.5) 0.432±0.014 (4) 0.536±0.007 (4) 0.321±0.009 (3.5) 0.369±0.017 (2) 0.445±0.009 (2.5) 3.42 MV3LSVM 0.412±0.025 (1) 0.476±0.015 (1) 0.555±0.006 (1) 0.332±0.013 (1) 0.376±0.017 (1) 0.449±0.008 (1) 1 SimpleMKL [16] 0.381±0.024 (3) 0.453±0.020 (3) 0.538±0.011 (3) 0.321±0.014 (3.5) 0.365±0.017 (4) 0.444±0.011 (2.5) 3.17 LpMKL [17] 0.391±0.024 (2) 0.462±0.012 (2) 0.540±0.006 (2) 0.327±0.010 (2) 0.367±0.014 (3) 0.436±0.008 (4) 2.5 VOC mAUC ↑ vs. #{labeled samples} MIR mAUC ↑ vs. #{labeled samples} Methods 100 200 500 100 200 500 Ranks SVM CAT 0.744±0.013 (7) 0.785±0.006 (7) 0.832±0.003 (7) 0.722±0.008 (4) 0.745±0.004 (6) 0.783±0.004 (6) 6.17 SVM UNI 0.783±0.008 (3) 0.824±0.009 (2.5) 0.870±0.003 (2.5) 0.718±0.011 (5) 0.742±0.011 (7) 0.782±0.006 (7) 4.5 MLCS [18] 0.773±0.010 (5) 0.819±0.010 (6) 0.869±0.004 (4) 0.701±0.012 (7) 0.749±0.010 (5) 0.805±0.005 (1.5) 4.42 KLS CCA [12] 0.781±0.009 (4) 0.824±0.008 (2.5) 0.866±0.003 (5) 0.737±0.009 (2) 0.769±0.010 (1.5) 0.805±0.005 (1.5) 2.75 MV3LSVM 0.801±0.011 (1) 0.835±0.011 (1) 0.875±0.004 (1) 0.741±0.014 (1) 0.769±0.012 (1.5) 0.802±0.005 (3.5) 1.5 SimpleMKL [16] 0.769±0.017 (6) 0.822±0.013 (4.5) 0.870±0.006 (2.5) 0.717±0.013 (6) 0.753±0.010 (4) 0.802±0.005 (3.5) 4.42 LpMKL [17] 0.786±0.008 (2) 0.822±0.008 (4.5) 0.862±0.005 (6) 0.732±0.010 (3) 0.756±0.010 (3) 0.795±0.007 (5) 3.92 VOC RL ↓ vs. #{labeled samples} MIR RL ↓ vs. #{labeled samples} Methods 100 200 500 100 200 500 Ranks SVM CAT 0.220±0.008 (7) 0.183±0.006 (7) 0.142±0.003 (7) 0.165±0.005 (3) 0.146±0.004 (5) 0.126±0.002 (5) 5.67 SVM UNI 0.178±0.008 (2) 0.143±0.006 (3) 0.106±0.003 (1.5) 0.549±0.040 (7) 0.437±0.022 (7) 0.177±0.011 (7) 4.58 MLCS [18] 0.195±0.007 (5) 0.155±0.007 (5) 0.112±0.004 (4) 0.173±0.006 (5) 0.145±0.005 (4) 0.115±0.003 (2.5) 4.25 KLS CCA [12] 0.183±0.008 (3) 0.149±0.006 (4) 0.122±0.005 (5) 0.168±0.013 (4) 0.143±0.005 (3) 0.121±0.004 (4) 3.83 MV3LSVM 0.170±0.007 (1) 0.140±0.007 (1) 0.108±0.003 (3) 0.150±0.006 (1) 0.130±0.007 (1) 0.111±0.003 (1) 1.33 SimpleMKL [16] 0.214±0.017 (6) 0.142±0.009 (2) 0.106±0.004 (1.5) 0.155±0.005 (2) 0.136±0.006 (2) 0.115±0.003 (2.5) 2.67 LpMKL [17] 0.186±0.011 (4) 0.164±0.010 (6) 0.137±0.007 (6) 0.199±0.014 (6) 0.181±0.007 (6) 0.141±0.004 (6) 5.67 (↑ indicates “the larger the better”; ↓ indicates “the smaller the better”. Mean and std. are reported. The best result is highlighted in boldface.)

VI.CONCLUSIONAND DISCUSSION APPENDIX A PROOF OF LEMMA 1 Most of the existing works on multi-label image classifica- P P tion use only single feature representation, and the multiple Proof: The matrix M = v θvMv = v θv(Lv ⊗ P feature methods usually assume that a single label is assigned In) = L ⊗ In, where L = v θvLv is defined as a convex to an image. However, an image is usually associated with combination of the scalar-valued graph Laplacians constructed + + multiple labels and different kinds of features are necessary from different views. L ∈ SN since each Lv ∈ SN , and thus + to describe the image properly. Therefore, we have developed we have M ∈ SNn according to the positive definite property 3 P a multi-view vector-valued manifold regularization (MV MR) on the Kronecker product. Here, L = v θv(Dv − Wv) can for multi-label image classification in which images are nat- be computed by using the following adjacency graph urally characterized by multiple views. MV3MR combines  X θvWvij if xi ∈ N(xj) or xj ∈ N(xi), different kinds of features in the learning process of the  Wij = v vector-valued function for multi-label classification. We also  0 otherwise, derived an SVM formulation of MV3MR, which results in MV3LSVM. The new algorithm effectively exploits the label where N(x) denotes a set that contains the k-nearest neighbors correlations and learns the view weights to integrate the con- of x and Wvij is the similarity between the ith and jth point sistency and complementary properties of different views. We from the vth view. Thus L is a graph Laplacian and M is the evaluate the proposed algorithm in terms of three popular cri- corresponding vector-valued graph Laplacian. teria, i.e. mAP, mAUC and RL. Intensive experiments on two challenge datasets PASCAL VOC’07 and MIR Flickr show APPENDIX B that the support vector machine based implementation under PROOFOFTHEREPRESENTERTHEOREM MV3MR outperforms the traditional multi-label algorithms Proof: It has been presented in Section IV-A that there as well as a well-known multiple kernel learning method. is an RKHS HK associated with the vector-valued kernel K. Furthermore, our method provides a strategy for learning The probability distribution is assumed to be supported on from multiple views in multi-label classification and can be a manifold M in the manifold regularization framework. We P extended to other multi-label algorithms. now denote S = { i K(xi, ·)ai|xi ∈ M, ai ∈ Y} as a linear > TNNLS-2012-P-0852 REVISION 2 < 12 space spanned by the kernels centered at the points on M. Any [19] R. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization,” , vol. 39, no. 2, pp. 135–168, function f ∈ HK can be decomposed as f = fk + f⊥, with ⊥ 2000. fk ∈ S and f⊥ ∈ S . It has been proved in Lemma 1 that M [20] A. Elisseeff and J. Weston, “A for multi-labelled classi- is a graph Laplacian. Thus we can use M to induce an intrinsic fication,” in Advances in Neural Information Processing Systems, 2001, norm k · kI , which satisfies kfkI = kgkI for any f, g ∈ HK , pp. 681–687. (f − g)| ≡ 0. According to the reproducing property, it [21] M. Zhang and Z. Zhou, “ML-KNN: A lazy learning approach to multi- M label learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 2007. concludes that f⊥ vanishes on M [9]. This means that for [22] A. McCallum, “Multi-label text classification with a mixture model any x ∈ M, we have f(x) = fk(x) and then kfkI = kfkkI . trained by em,” in AAAI workshop on Text Learning, 1999. 2 2 2 2 [23] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-label Besides kfk = kfkk + kf⊥k ≥ kfkk , and thus we K K K K classification,” in Proceedings of the 14th ACM international conference conclude that the minimizer of the problem (6) must lie in S on Knowledge Discovery and Data mining, 2008, pp. 668–676. for fixed β and θ. Furthermore, because M is approximated [24] S. Ji, L. Tang, S. Yu, and J. Ye, “A shared-subspace learning frame- by the Laplacian of the graph constructed by the labeled and work for multi-label classification,” ACM Transactions on Knowledge unlabeled samples, we have S = {Pl+u K(x , ·)a |a ∈ Y}. Discovery from Data, vol. 4, no. 2, pp. 1–29, 2010. i=1 i i i [25] S. Zafeiriou, G. Tzimiropoulos, M. Petrou, and T. Stathaki, “Regularized This completes the proof. kernel discriminant analysis with a robust kernel for face recognition and verification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 3, pp. 526–534, 2012. REFERENCES [26] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan, “Learning the kernel matrix with semidefinite programming,” in Pro- [1] M. Boutell, J. Luo, X. Shen, and C. Brown, “Learning multi-label scene ceedings of the nineteenth International Conference on Machine Learn- classification,” Pattern recognition, vol. 37, no. 9, pp. 1757–1771, 2004. ing, 2002, pp. 323–330. [2] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi- [27] S. Sonnenburg, G. Ratsch,¨ C. Schafer,¨ and B. Scholkopf,¨ “Large scale supervised learning for image classification,” in IEEE conference on multiple kernel learning,” The Journal of Machine Learning Research, Computer Vision and Pattern Recognition, 2010, pp. 902–909. vol. 7, pp. 1531–1565, 2006. [3] D. Lowe, “Distinctive image features from scale-invariant keypoints,” [28] K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Se- International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, jnowski, “Dictionary learning algorithms for sparse representation,” 2004. Neural Computation, vol. 15, no. 2, pp. 349–396, 2003. [4] A. Oliva and A. Torralba, “Modeling the shape of the scene: A [29] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial holistic representation of the spatial envelope,” International Journal pyramid matching for recognizing natural scene categories,” in IEEE of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001. Conference on Computer Vision and Pattern Recognition, 2006, pp. [5] E. C¸esmeli and D. Wang, “Texture segmentation using gaussian-markov 2169–2178. random fields and neural oscillator networks,” IEEE Transactions on Neural Networks, vol. 12, no. 2, pp. 394–404, 2001. [30] K. Labusch, E. Barth, and T. Martinetz, “Simple method for high- performance digit recognition based on sparse coding,” IEEE Trans- [6] D. Masip and J. Vitria,` “Shared feature extraction for nearest neighbor actions on Neural Networks, vol. 19, no. 11, pp. 1985–1989, 2008. face recognition,” IEEE Transactions on Neural Networks, vol. 19, no. 4, pp. 586–595, 2008. [31] X. Zhou, K. Yu, T. Zhang, and T. Huang, “Image classification using Proceedings of the [7] C. Micchelli and M. Pontil, “On learning vector-valued functions,” super-vector coding of local image descriptors,” in eleventh European Conference on Computer Vision Neural Computation, vol. 17, no. 1, pp. 177–204, 2005. , 2010, pp. 141–154. [8] H. Minh and V. Sindhwani, “Vector-valued manifold regularization,” in [32] Y. Yang and S. Newsam, “Spatial pyramid co-occurrence for image Proceedings of the 28th International Conference on Machine Learning, classification,” in IEEE International Conference on Computer Vision, 2011, pp. 57–64. 2011, pp. 1465–1472. [9] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A [33] H. Su, M. Sun, L. Fei-Fei, and S. Savarese, “Learning a dense multi- geometric framework for learning from labeled and unlabeled examples,” view representation for detection, viewpoint classification and synthesis The Journal of Machine Learning Research, vol. 7, pp. 2399–2434, of object categories,” in IEEE International Conference on Computer 2006. Vision, 2009, pp. 213–220. [10] G. Chen, Y. Song, F. Wang, and C. Zhang, “Semi-supervised multi- [34] Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z. Zhou, “Multi-view video label learning by solving a sylvester equation,” in SIAM international summarization,” IEEE Transactions on Multimedia, vol. 12, no. 7, pp. conference on Data Mining, 2008, pp. 410–419. 717–729, 2010. [11] B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma, “Large [35] A. Iosifidis, A. Tefas, and I. Pitas, “View-invariant action recognition scale max-margin multi-label classification with priors,” in Proceedings based on artificial neural networks,” IEEE Transactions on Neural of the 27th International Conference on Machine Learning, 2010, pp. Networks and Learning Systems, vol. 23, no. 3, pp. 412–424, 2012. 423–430. [36] A. Saffari, C. Leistner, M. Godec, and H. Bischof, “Robust multi- [12] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multilabel view boosting with priors,” in Proceedings of the eleventh European classification: A least-squares formulation, extensions, and analysis,” Conference on Computer Vision, 2010, pp. 776–789. IEEE Transactions on Pattern Analysis and Machine Intelligence, [37] D. Zhang, F. Wang, L. Si, and T. Li, “Maximum margin multiple instance vol. 33, no. 1, pp. 194–200, 2011. clustering with its applications to image and text clustering,” IEEE [13] L. Rosasco, E. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss Transactions on Neural Networks, vol. 22, no. 5, pp. 739–751, 2011. functions all the same?” Neural Computation, vol. 16, no. 5, pp. 1063– [38] H. Zhang, G. Liu, T. Chow, and W. Liu, “Textual and visual content- 1076, 2004. based anti-phishing: a bayesian approach,” IEEE Transactions on Neural [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser- Networks, vol. 22, no. 10, pp. 1532–1546, 2011. man, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) [39] S. Bucak, R. Jin, and A. Jain, “Multi-label learning with incomplete Results.” class assignments,” in IEEE Conference on Computer Vision and Pattern [15] M. J. Huiskes and M. S. Lew, “The MIR flickr retrieval evaluation,” Recognition, 2011, pp. 2801–2808. in Proceeding of the 1st ACM international conference on Multimedia [40] X. Chen, X. Yuan, Q. Chen, S. Yan, and T. Chua, “Multi-label visual Information Retrieval, 2008, pp. 39–43. classification with label exclusive context,” in IEEE International Con- [16] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” ference on Computer Vision, 2011, pp. 834–841. Journal of Machine Learning Research, vol. 9, pp. 2491–2521, 2008. [41] Z. Fan, Y. Xu, and D. Zhang, “Local linear discriminant analysis [17] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “Lp-norm multiple framework using sample neighbors,” Ne IEEE Transactions on ural kernel learning,” The Journal of Machine Learning Research, vol. 12, Networks, vol. 22, no. 7, pp. 1119–1132, 2011. pp. 953–997, 2011. [42] D. Bouchaffra, “Mapping dynamic bayesian networks to α-shapes: Ap- [18] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction plication to human faces identification across ages,” IEEE Transactions via compressed sensing,” in Advances in Neural Information Processing on Neural Networks and Learning Systems, vol. 23, no. 8, pp. 1229– Systems, 2009, pp. 772–780. 1241, 2012. > TNNLS-2012-P-0852 REVISION 2 < 13

[43] L. Chen, I. Tsang, and D. Xu, “Laplacian embedded regression for scal- able manifold regularization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 6, pp. 902–915, 2012. [44] V. Strassen, “Gaussian elimination is not optimal,” Numer. Math., vol. 13, no. 4, pp. 354–356, 1969. [45] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” Advances in Kernel Methods, pp. 185–208, 1999. [46] J. Van De Weijer and C. Schmid, “Coloring local feature extraction,” European Conference on Computer Vision, pp. 334–348, 2006. [47] M. Zhu, “Recall, precision and average precision,” University of Water- loo, Tech. Rep., 2004. [48] T. Fawcett, “Roc graphs: Notes and practical considerations for re- searchers,” Machine Learning, vol. 31, pp. 1–38, 2004. [49] X. Zhu, “Semi-supervised learning literature survey,” University of Wisconsin Madison, Tech. Rep., 2006. [50] M. Gori, “On the time complexity of regularized least square,” in WIRN, 2011, pp. 85–96. [51] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004. [52] J. Demsar,ˇ “Statistical comparisons of classifiers over multiple data sets,” The Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.