Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Self-weighted Multiple Kernel Learning for Graph-based Clustering and Semi-supervised Classification

Zhao Kang1∗, Xiao Lu1, Jinfeng Yi2, Zenglin Xu1∗, 1 SMILE Lab, School of Computer Science and Engineering University of Electronic Science and Technology of China, Sichuan 611731, China 2 JD AI Research, Beijing 100101, China [email protected], [email protected], [email protected], [email protected]

Abstract base kernels [Xu et al., 2009]. Generally speaking, MKL method should yield a better performance than that of single Multiple kernel learning (MKL) method is gener- kernel approach. ally believed to perform better than single kernel A key step in MKL is to assign a reasonable weight to method. However, some empirical studies show each kernel according to its importance. One popular ap- that this is not always true: the combination of mul- proach considers a weighted combination of candidate ker- tiple kernels may even yield an even worse perfor- nel matrices, leading to a convex quadratically constraint mance than using a single kernel. There are two quadratic program. However, this method over-reduces the possible reasons for the failure: (i) most existing feasible set of the optimal kernel, which may lead to a MKL methods assume that the optimal kernel is a less representative solution. In fact, these MKL algorithms linear combination of base kernels, which may not sometimes fail to outperform single kernel methods or tra- hold true; and (ii) some kernel weights are inap- ditional non-weighted kernel approaches [Yu et al., 2010; propriately assigned due to noises and carelessly Gehler and Nowozin, 2009]. Another issue is the inappro- designed algorithms. In this paper, we propose a priate weights assignment. Some attempts aim to learn the novel MKL framework by following two intuitive local importance of features by assuming that samples may assumptions: (i) each kernel is a perturbation of the vary locally [Gonen¨ and Alpaydin, 2008]. However, they in- consensus kernel; and (ii) the kernel that is close duce more complex computational problems. to the consensus kernel should be assigned a large weight. Impressively, the proposed method can au- To address these issues, in this paper, we model the differ- tomatically assign an appropriate weight to each ences among kernels by following two intuitive assumptions: kernel without introducing additional parameters, (i) each kernel is a perturbation of the consensus kernel; and as existing methods do. The proposed framework is (ii) the kernel that is close to the consensus kernel should re- integrated into a unified framework for graph-based ceive a large weight. As a result, instead of enforcing the clustering and semi-supervised classification. We optimal kernel being a linear combination of predefined ker- have conducted experiments on multiple bench- nels, this approach allows the most suitable kernel to reside in mark datasets and our empirical results verify the some kernels’ neighborhood. And our proposed method can superiority of the proposed framework. assign an optimal weight for each kernel automatically with- out introducing an additive parameter as existing methods do. Then we combine this novel weighting scheme with graph- 1 Introduction based clustering and semi- (SSL). Due to As a principled way of introducing non-linearity into linear its effectiveness in similarity graph construction, graph-based models, kernel methods have been widely applied in many clustering and SSL have shown impressive performance [Nie tasks [Hofmann et al., 2008; Xu et al., et al., 2017a; Kang et al., 2017b]. Finally, a novel multiple 2010]. Although improved performance has been reported kernel learning framework for clustering and semi-supervised in a wide variety of problems, the kernel methods require learning is developed. the user to select and tune a single pre-defined kernel. This In summary, our main contributions are two-fold: is not user-friendly since the most suitable kernel for a spe- • We proposed a novel way to construct the optimal kernel cific task is usually challenging to decide. Moreover, it is and assign weights to base kernels. Notably, our pro- time-consuming and impractical to exhaustively search from posed method can find a better kernel in the neighbor- a large pool of candidate kernels. Multiple kernel learning hood of candidate kernels. This weight is a function of (MKL) was proposed to address this issue as it offers an au- kernel matrix, so we do not need to introduce an addi- tomatical way of learning an optimal combination of distinct tional parameter as existing methods do. This also eases ∗Corresponding author. the burden of solving the constraint quadratic program.

2312 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

• A unified framework for clustering and SSL is devel- . This will hinder the practical use of MKL. oped. It seamlessly integrates the components of graph This phenomenon has been observed for many years [Cortes, construction, label learning, and kernel learning by in- 2009] but rarely studied. Thus, it is vital to develop some new corporating the graph structure constraint. This allows approaches. them to negotiate with each other to achieve overall opti- mality. Our experiments on multiple real-world datasets 3 Methodology verify the effectiveness of the proposed framework. 3.1 Notations 2 Related Work Throughout the paper, all the matrices are written as upper- case. For a matrix X, its ij-th element and j-th column is de- In this section, we divide the related work into two categories, noted as xij and Xj, respectively. The trace of X is denoted namely graph-based clusteirng and SSL, and paremeter- by T r(X). The i-th kernel of X is written as Hi. The p-norm weighted multiple kernel learning. of vector x is represented by kxkp. The Frobenius norm of 2.1 Graph-based Clustering and SSL matrix X is denoted by kXkF . I is an identity matrix with proper size. S ≥ 0 means all entries of S are nonnegative. Graph-based clustering [Ng et al., 2002; Yang et al., 2017] and SSL [Zhu et al., 2003] have been popular for its simple 3.2 Self-weighted Multiple Kernel Learning and impressive performance. The graph matrix to measure Aforementioned self-expressiveness based graph learning the similarity of data points is crucial to their performance method can be formulated as: and there is no satisfying solution for this problem. Recently, 2 2 min kX − XSkF + γkSkF s.t. S ≥ 0, (2) automatic learning graph from data has achieved promising S results. One approach is based on adaptive neighbor idea, where γ is the trade-off parameter. To recap the powerfulness i.e., sij is learned as a measure of the probability that Xi is neighbor of X . Then S is treated as graph input to do of kernel method, we extend Eq. (2) to its kernel version j φ clustering [Nie et al., 2014; Huang et al., 2018b] and SSL by using kernel mapping . According to the kernel trick K(x, z) = φ(x)T φ(z) [Nie et al., 2017a]. Another one is using the so-called self- , we have 2 2 expressiveness property, i.e., each data point is expressed as a min kφ(X) − φ(X)SkF + γkSkF , weighted combination of other points and this learned weight S T 2 matrix behaves like the graph matrix. Representative work ⇐⇒ min T r(K − 2KS + S KS) + γkSkF (3) in this category include [Huang et al., 2015; Li et al., 2015; S Kang et al., 2018]. These methods are all developed in the s.t. S ≥ 0 original feature space. To make it more general, we develop Ideally, we hope to achieve a graph with exactly c connected our model in kernel space in this paper. Our purpose is to components if the data contain c clusters or classes. In other learn a graph with exactly c number of connected components words, the graph S is block diagonal with proper permuta- if the data contains c clusters or classes. In this work, we will tions. It is straightforward to check that S in Eq. (3) can consider this condition explicitly. hardly satisfy to such a constraint condition. If the similarity graph matrix S is nonnegative, then the 2.2 Parameter-weighted Multiple Kernel Learning Laplacian matrix L = D −S, where D is the diagonal degree P It is well-known that the performance of kernel method cru- matrix defined as dii = j sij , associated with S has an cially depends on the kernel function as it intrinsically speci- important property as follows [Mohar et al., 1991] fies the feature space. MKL is an efficient way for automatic kernel selection and embedding different notions of similarity Theorem 1 The multiplicity c of the eigenvalue 0 of the [Kang et al., 2017a]. It is generally formulated as follows: Laplacian matrix L is equal to the number of connected com- ponents in the graph associated with S. X i X q min f(K) s.t. K = θiH , (θi) = 1, θi ≥ 0, Theorem 1 indicates that if rank(L) = n − c, then the con- θ i i straint on S will be held. Therefore, the problem (3) can be (1) rewritten as: where f is the objective function, K is the consensus ker- T 2 i min T r(K − 2KS + S KS) + γkSkF nel, H is our artificial constructed kernel, θi represents the S i (4) weight for kernel H , q > 0 is used to smoothen the weight s.t. S ≥ 0, rank(L) = n − c distribution. Therefore, we frequently sovle θ and tune q. Though this approach is widely used, it still suffers the fol- Suppose σi(L) is the i-th smallest eigenvalue of L. Note that lowing problems. First, the linear combination of base ker- σi(L) ≥ 0 because L is positive semi-definite. The problem nels over reduces the feasible set of optimal kernels, which (4) is equivalent to the following problem for a large enough could result in the learned kernel with limited representation α: ability. Second, the optimization of kernel weights may lead c T 2 X to inappropriate assignments due to noise and carelessly de- min T r(K − 2KS + S KS) + γkSkF + α σi(L) S signed algorithms. Indeed, contrary to the original intention i=1 of MKL, this approach sometimes obtains lower accuracy s.t. S ≥ 0 than that of using equally weighted kernels or merely single (5)

2313 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

According to the Ky Fan’s Theorem [Fan, 1949], we have: 2. By treating the optimal kernel as a perturbation of base c kernels, it effectively enlarges the region from which an X T optimal kernel can be chosen, and therefore is in a bet- σi(L) = min T r(P LP ) (6) P ∈Rn×c,P T P =I ter position than the traditional ones to identify a more i=1 suitable kernel. P can be cluster indicator matrix or label matrix. Therefore, 3. The kernel weight is directly calculated from kernel ma- the problem (5) is further equivalent to the following problem trices. Therefore, we avoid solving quadratic program. T 2 T To see the effect of our proposed MKL method, we need to min T r(K − 2KS + S KS) + γkSkF + αT r(P LP ) S,P examine the approach with traditional kernel learning. For s.t. P T P = I,S ≥ 0 convenience, we denote it as Parameterized MKL (PMKL). (7) It can be written as: T 2 T min T r(K − 2KS + S KS) + γkSkF + αT r(P LP ) This problem (7) is much easier to solve compared with the S,P,θ rank constrained problem (4). We name this model as Kernel- r r X i X p based Graph Learning (KGL). Note that this model’s input is s.t. K = θiH , θi = 1, θi ≥ 0, kernel matrix K. It is generally recognized that its perfor- i=1 i=1 mance is largely determined by the choice of kernel. Unfor- P T P = I,S ≥ 0. tunately, the most suitable kernel for a particular task is often unknown in advance. Although MKL as in Eq. (1) can be (11) applied to resolve this issue, it is still not satisfying as we 3.3 Optimization discussed in subsection 2.2. In this work, we design a novel MKL strategy. It is based We divide the problem in Eq. (10) into three subproblems, on the following two intuitive assumptions: 1) each kernel is and develop an alternative and iterative algorithm to solve a perturbation of the consensus kernel, and 2) the kernel that them. is close to the consensus kernel should receive a large weight. For S, we fix P and K. The problem in Eq. (10) becomes: Motivated by these, we can have the following MKL form: T 2 T min T r(−2KS + S KS) + γkSkF + αT r(P LP ), S X i 2 min wikH − KkF (8) K s.t. S ≥ 0. i (12) and P 1 2 T 1 Based on ij 2 kPi,: − Pj,:k2sij = T r(P LP ), we can wi = i (9) 2kH − KkF equivalently solve the following problem for each sample:

We can see that wi is dependent on the target variable K, so T T α T −2Ki,:Si + Si KSi + γSi Si + Gi Si, (13) it is not directly available. But wi can be set to be stationary, 2 i.e., after obtaining K, we update w correspondingly [Nie i where G ∈ Rn×1 with g = kP − P k2. By setting its et al., 2017b]. Instead of enforcing the optimal kernel being i ij i,: j,: 2 first derivative w.r.t. S to be zero, we obtain: a linear combination of candidate kernels as in Eq. (1), Eq. i (8) allows the most suitable kernel to reside in some kernels’ αG S = (γI + K)−1(K − i ). (14) neighborhood [Liu et al., 2009]. This enhances the represen- i i,: 4 tation ability of the learned optimal kernel [Liu et al., 2017; Thus S can be achieved in parallel. 2013]. Furthermore, we don’t introduce an additive hyper- For K, we fix S and P . The problem in Eq. (10) becomes: parameter θ, which often leads to a quadratic program. The i optimal weight wi for each kernel H is directly calculated T X i 2 min T r(K − 2KS + S KS) + β wikH − KkF (15) according to kernel matrices. Then our Self-weighted Mul- K i tiple Kernel Learning (SMKL) framework can be formulated as: Similar to (14), it yields: T 2 T T T P i min T r(K − 2KS + S KS) + γkSkF + αT r(P LP ) 2S − SS − I + 2β i wiH S,P,K K = P . (16) r 2β i wi X i 2 T + β wikH − KkF s.t. P P = I,S ≥ 0. From Eq. (16) and Eq. (14), we can observe that S and K are i=1 seamlessly coupled, hence they are allowed to negotiate with (10) each other to achieve better results. This model enjoys the following properties: For P , we fix S and K. The problem in Eq. (10) becomes: 1. This unified framework sufficiently considers the negoti- min T r(P T LP ) s.t. P T P = I. (17) ation between the process of learning the optimal kernel P and that of graph/label learning. By iteratively updating The optimal solution P is the c eigenvectors of L correspond- S, P , K, they can be repeatly improved. ing to the c smallest eigenvalues.

2314 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 3.4 Extend to Semi-supervised Classification Table 1: Statistics of the data sets Model (10) also lends itself to semi-supervised classification. # instances # features # classes Graph construction and label inference are two fundamental YALE 165 1024 15 stages in SSL. Solving two separate problems only once is JAFFE 213 676 10 suboptimal since label information is not exploited when it YEAST 1484 1470 10 learns the graph. SMKL unifies these two fundamental com- TR11 414 6429 9 ponents into a unified framework. Then the given labels and TR41 878 7454 10 estimated labels will be utilized to build the graph and to pre- TR45 690 8261 10 dict the unknown labels. Based on a similar approach, we can reformulate SMKL for semi-supervised classification as: 4.1 Data Sets

T 2 T We implement experiments on six publicly available data min T r(K − 2KS + S KS) + γkSkF + αT r(P LP ) S,P,K sets. We summarize the information of these data sets in Ta- ble 1. In specific, the first two data sets YALE and JAFFE X i 2 + β wikH − KkF s.t. S ≥ 0,Pl = Yl consist of face images. YEAST is microarray data set. Tr11, i Tr41, and Tr45 are derived from NIST TREC Document (18) Database. We design 12 kernels. They are: seven Gaussian kernels of T 2 2 where Yl = [y1, ··· , yl] denote the label matrix and l is the form K(x, y) = exp(−kx − yk /(td )), where dmax c×1 2 max the number of labeled points. yi ∈ R is one-hot and is the maximal distance between samples and t varies over the yij = 1 indicates that the i-th sample belongs to the j-th set {0.01, 0.05, 0.1, 1, 10, 50, 100}; a linear kernel K(x, y) = class. P = [Pl; Pu] = [Yl; Pu], where the unlabeled u points xT y; four polynomial kernels K(x, y) = (a + xT y)b with in the back. (18) can be solved in the same procedure as (10), a ∈ {0, 1} and b ∈ {2, 4}. Besides, all kernels are rescaled to the difference lies in updating P . [0, 1] by dividing each element by the largest pairwise squared To solve P , we take the derivative of (18) with respect to distance. P , we have LP = 0, i.e., 4.2 Comparison Methods L L   Y  ll lu l = 0. We compare with a number of single kernel and multiple ker- Lul Luu Pu nel learning based clustering methods. They include: Spec- tral Clustering (SC) [Ng et al., 2002], Simplex Sparse Rep- It yields: resentation (SSR) [Huang et al., 2015], Multiple Kernel k- −1 Pu = −Luu LulYl. (19) means (MKKM) [Huang et al., 2012b], Affinity Aggregation for Spectral Clustering (AASC) [Huang et al., 2012a], Ro- Finally, the class label for unlabeled points could be as- bust Multiple Kernel k-means (RMKKM)1 [Du et al., 2015]. signed according to the following decision rule: Among them, SC, SSR, AASC are graph based methods, while the others are k-means variants. Although there are [ et al. ] yi = argmax Pij . (20) much recent work focused on multiview Huang , 2018a j and based clustering [Peng et al., 2016], they are not our focus in this paper. 4.3 Performance Evaluation ALGORITHM 1: The Proposed Framework SMKL SMKL is compared with other techniques. We show the clus- Input: {Hi}r α β γ Kernel matrices i=1, parameters , , . tering results in terms of accuracy (Acc), NMI in Table 2. Initialize: Random matrix S, K = P Hi/r. i For SC and KGL, we report its best performance achieved REPEAT 1: For each i, update the i-th column of S according to (14). from those 12 kernels. It can clearly be seen that SMKL 2: Calculate K by (16). achieves the best performance in most cases. Compared to 2 3: Update w by (9). PMKL, SMKL works better. As shown in Eq. (10) and (11), 4: For clustering, calculte P as the c smallest eigenvectors of this is attributed to our new MKL scheme. Note that, KGL L = D − S correspond to the c smallest eigenvalues. For SSL, outperforms PMKL in several experiments. This is consis- calculte P according to (19). tent with previous work’s claim that MKL may degrade the UNTIL stopping criterion is met. performance. However, our proposed SMKL can beat KGL in all cases. This also demonstrates the effectiveness of our MKL strategy. With respect to recently developed methods SSR and RMKKM, we also observe considerable improve- 4 Clustering ment. Remember that SSR is based on self-expressiveness

In this section, we conduct clustering experiments to demon- 1https://github.com/csliangdu/RMKKM strate the efficacy of our method. 2https://github.com/sckangz/IJCAI2018

2315 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Data Metric SC SSR MKKM RMKKM AASC KGL PMKL SMKL Acc 0.4942 0.5455 0.4570 0.5218 0.4064 0.5549 0.5605 0.6000 Y ALE NMI 0.5292 0.5726 0.5006 0.5558 0.4683 0.5498 0.5643 0.6029 Acc 0.7488 0.8732 0.7455 0.8707 0.3035 0.9877 0.9802 0.9906 J AFFE NMI 0.8208 0.9293 0.7979 0.8937 0.2722 0.9825 0.9806 0.9834 Acc 0.3555 0.2999 0.1304 0.3163 0.3538 0.3892 0.3952 0.4326 YEAST NMI 0.2138 0.1585 0.1029 0.2071 0.2119 0.2315 0.2361 0.2652 Acc 0.5098 0.4106 0.5013 0.5771 0.4715 0.7425 0.7485 0.8309 TR11 NMI 0.4311 0.2760 0.4456 0.5608 0.3939 0.6000 0.6137 0.7167 Acc 0.6352 0.6378 0.5610 0.6265 0.4590 0.6942 0.6724 0.7631 TR41 NMI 0.6133 0.5956 0.5775 0.6347 0.4305 0.6008 0.6500 0.6148 Acc 0.5739 0.7145 0.5846 0.6400 0.5264 0.7425 0.7468 0.7536 TR45 NMI 0.4803 0.6782 0.5617 0.6273 0.4190 0.6824 0.7523 0.6965

Table 2: Clustering results on benchmark data sets. The best results are in bold font.

in the original space. Compared to traditional methods SC, MKKM, AASC, our advantages become more obvious. 4.4 Parameter Analysis There are three parameters in our model (10). Figure 1 shows the clustering accuracy of YALE data set with varying α, β, 0.6 and γ. We can observe that the performance is not so sensitive 0.4 to those parameters. This conclusion is also true for NMI. Acc 0.2 5 Semi-supervised Classification 0 In this section, we assess the effectiveness of SMKL on semi- 0.01 supervised classification task. 0.1 1 0.001 10 0.0001 5.1 Data Sets 1e−05 γ 1e−06 α 1) Evaluation on Face Recognition: We examine the effec- tiveness of our graph learning for face recognition on two fre- (a) β = 10 quently used face databases: YALE and JEFFE. The YALE face data set contains 15 individuals, and each person has 11 near frontal images taken under different illuminations. Each image is resized to 32×32 pixels. The JAFFE face database consists of 10 individuals, and each subject has 7 different fa- cial expressions (6 basic facial expressions +1 neutral). The 0.6 images are resized to 26×26 pixels. 0.4 2) Evaluation on Digit/Letter Recognition: In this experi- ment, we address the digit/letter recognition problem on the Acc 0.2 BA database. The data set consists of digits of “0” through 0 “9” and letters of capital “A” to “Z”. Therefore, there are 39 classes and each class has 39 samples. 0.01 0.1 3) Evaluation on Visual Object Recognition: We con- 1 0.001 duct visual object recognition experiment on the COIL20 10 0.0001 1e−05 database. The database consists of 20 objects and 72 images γ 1e−06 α for each object. For each object, the images were taken 5 de- grees apart as the object is rotating on a turntable. The size of (b) β = 100 each image is 32×32 pixels. Similar to clustering experiment, we construct 7 kernels for Figure 1: The influence of parameters on accuracy of YALE data each data set. They include: four Gaussian kernels with t set. varies over {0.1, 1, 10, 100}; a linear kernel K(x, y) = xT y; two polynomial kernels K(x, y) = (a + xT y)2 with a ∈ {0, 1}.

2316 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Data Labeled (%) GFHF LGC S3R S2LRR SCAN SMKL 10 38.011.91 47.3313.96 38.838.60 28.779.59 45.071.30 55.8712.26 YALE 30 54.139.47 63.082.20 58.254.25 42.585.93 60.924.03 74.081.92 50 60.285.16 69.565.42 69.006.57 51.226.78 68.944.57 82.443.61 10 92.857.76 96.682.76 97.331.51 94.386.23 96.921.68 97.571.55 JAFFE 30 98.501.01 98.861.14 99.250.81 98.821.05 98.201.22 99.670.33 50 98.941.11 99.290.94 99.820.60 99.470.59 99.255.79 99.910.27 10 45.093.09 48.371.98 25.321.14 20.102.51 55.051.67 46.621.98 BA 30 62.740.92 63.311.03 44.161.03 43.841.54 68.841.09 68.990.93 50 68.301.31 68.451.32 54.101.55 52.491.27 72.201.44 84.671.06 10 87.742.26 85.431.40 93.571.59 81.101.69 90.091.15 91.052.03 COIL20 30 95.481.40 87.821.03 96.520.68 87.691.39 95.270.93 97.892.00 50 96.270.71 88.470.45 97.870.10 90.921.19 97.530.82 99.970.04

Table 3: Classification accuracy (%) on benchmark data sets (meanstandard deviation). The best results are in bold font.

5.2 Comparison Methods model, a more flexible kernel learning strategy is developed to We compare our method with several other state-of-the-art enhance the representation ability of the learned optimal ker- algorithms. nel and to assign weight for each base kernel. An iterative al- gorithm is designed to solve the resultant optimization prob- [ • Local and Global Consistency (LGC) Zhou et al., lem, so that graph construction, label learning, kernel learn- ] 2004 : LGC is a popular label propagation method. For ing are boosted by each other. Comprehensive experimental this method, kernel matrix is used to compute L. results clearly demonstrates the superiority of our method. • Gaussian Field and Harmonic function (GFHF) [Zhu et al., 2003]: Different from LGC, GFHF is another me- chanics to infer those unknown labels as a process of Acknowledgments propagating labels through the pairwise similarity. This paper was in part supported by two Fundamental Re- • Semi-supervised Classification with Adaptive Neigh- search Funds for the Central Universities of China (Nos. bours (SCAN) [Nie et al., 2017a]: Based on adaptive ZYGX2017KYQD177, ZYGX2016Z003), Grants from the neighbors method, SCAN shows much better perfor- Natural Science Foundation of China (No. 61572111) and a mance than many other techniques. 985 Project of UESTC (No. A1098531023601041) . • A Unified Optimization Framework for SSL [Li et al., 2015]: Li et al. propose a unified framework based on References self-expressiveness approach. By using low-rank and sparse regularizer, they have S2LRR and S3R method, [Cortes, 2009] Corinna Cortes. Can learning kernels help respectively. performance. In Invited talk at International Conference on Machine Learning (ICML 2009). Montreal,´ Canada, 5.3 Performance Evaluation 2009. We randomly choose 10%, 30%, 50% portions of samples as [Du et al., 2015] Liang Du, Peng Zhou, Lei Shi, Hanmo labeled data and repeat 20 times. Classification accuracy and Wang, Mingyu Fan, Wenjian Wang, and Yi-Dong Shen. deviation are shown in Table 3. More concretely, for GFHF Robust multiple kernel k-means using l21-norm. In Pro- and LGC, the constructed seven kernels are tested and the ceedings of the 24th International Conference on Artificial 2 best performance is reported. Unlike them, SCAN, S LRR, Intelligence, pages 3476–3482. AAAI Press, 2015. S3R, and SMKL, the label prediction and graph learning are conducted in a unified framework. [Fan, 1949] Ky Fan. On a theorem of weyl concerning eigen- As expected, the classification accuracy for all methods values of linear transformations i. Proceedings of the Na- monotonically increase with the increase of the percentage of tional Academy of Sciences, 35(11):652–655, 1949. labeled samples. As can be observed, our SMKL method out- [Gehler and Nowozin, 2009] Peter Gehler and Sebastian performs other state-of-the-art methods in most cases. This Nowozin. On feature combination for multiclass object confirms the effectiveness of our proposed method on SSL classification. In Computer Vision, 2009 IEEE 12th Inter- task. national Conference on, pages 221–228. IEEE, 2009. [Gonen¨ and Alpaydin, 2008] Mehmet Gonen¨ and Ethem Al- 6 Conclusion paydin. Localized multiple kernel learning. In Pro- This paper proposes a novel multiple kernel learning frame- ceedings of the 25th international conference on Machine work for clustering and semi-supervised classification. In this learning, pages 352–359. ACM, 2008.

2317 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

[Hofmann et al., 2008] Thomas Hofmann, Bernhard [Mohar et al., 1991] Bojan Mohar, Y Alavi, G Chartrand, Scholkopf,¨ and Alexander J Smola. Kernel methods and OR Oellermann. The laplacian spectrum of graphs. in machine learning. The annals of statistics, pages Graph theory, combinatorics, and applications, 2(871- 1171–1220, 2008. 898):12, 1991. [Huang et al., 2012a] Hsin-Chien Huang, Yung-Yu Chuang, [Ng et al., 2002] Andrew Y Ng, Michael I Jordan, Yair and Chu-Song Chen. Affinity aggregation for spectral Weiss, et al. On spectral clustering: Analysis and an algo- clustering. In Computer Vision and Pattern Recognition rithm. Advances in neural information processing systems, (CVPR), 2012 IEEE Conference on, pages 773–780. IEEE, 2:849–856, 2002. 2012. [Nie et al., 2014] Feiping Nie, Xiaoqian Wang, and Heng [Huang et al., 2012b] Hsin-Chien Huang, Yung-Yu Chuang, Huang. Clustering and projected clustering with adaptive and Chu-Song Chen. Multiple kernel fuzzy cluster- neighbors. In Proceedings of the 20th ACM SIGKDD in- ing. IEEE Transactions on Fuzzy Systems, 20(1):120–134, ternational conference on Knowledge discovery and data 2012. mining, pages 977–986. ACM, 2014. [Huang et al., 2015] Jin Huang, Feiping Nie, and Heng [Nie et al., 2017a] Feiping Nie, Guohao Cai, and Xuelong Huang. A new simplex sparse learning model to measure Li. Multi-view clustering and semi-supervised classifica- data similarity for clustering. In IJCAI, pages 3569–3575, tion with adaptive neighbours. In AAAI, pages 2408–2414, 2015. 2017. [Huang et al., 2018a] Shudong Huang, Zhao Kang, and [Nie et al., 2017b] Feiping Nie, Jing Li, and Xuelong Li. Zenglin Xu. Self-weighted multi-view clustering with soft Self-weighted multiview clustering with multiple graphs. capped norm. Knowledge-Based Systems, 2018. In Proceedings of the Twenty-Sixth International Joint [Huang et al., 2018b] Shudong Huang, Zenglin Xu, and Conference on Artificial Intelligence, pages 2564–2570, Jiancheng Lv. Adaptive local structure learning for doc- 2017. ument co-clustering. Knowledge-Based Systems, 148:74– [Peng et al., 2016] Xi Peng, Shijie Xiao, Jiashi Feng, Wei- 84, 2018. Yun Yau, and Zhang Yi. Deep subspace clustering with [Kang et al., 2017a] Zhao Kang, Chong Peng, and Qiang sparsity prior. In IJCAI, pages 1925–1931, 2016. Cheng. Kernel-driven similarity learning. Neurocomput- [Xu et al., 2009] Zenglin Xu, Rong Jin, Irwin King, and ing, 267:210–219, 2017. Michael Lyu. An extended level method for efficient mul- [Kang et al., 2017b] Zhao Kang, Chong Peng, and Qiang tiple kernel learning. In Advances in neural information Cheng. Twin learning for similarity and clustering: A processing systems, pages 1825–1832, 2009. unified kernel approach. In Proceedings of the Thirty- [Xu et al., 2010] Zenglin Xu, Rong Jin, Haiqin Yang, Irwin First AAAI Conference on Artificial Intelligence (AAAI- King, and Michael R Lyu. Simple and efficient multiple 17). AAAI Press, 2017. kernel learning by group lasso. In Proceedings of the 27th [Kang et al., 2018] Zhao Kang, Chong Peng, Qiang Cheng, international conference on machine learning (ICML-10), and Zenglin Xu. Unified spectral clustering with optimal pages 1175–1182. Citeseer, 2010. graph. In Proceedings of the Thirty-Second AAAI Con- [Yang et al., 2017] Yang Yang, Fumin Shen, Zi Huang, , ference on Artificial Intelligence (AAAI-18). AAAI Press, Heng Tao Shen, and Xuelong Li. Discrete nonnegative 2018. spectral clustering. IEEE Transactions on Knowledge and [Li et al., 2015] Chun-Guang Li, Zhouchen Lin, Honggang Data Engineering, 29(9):1834–1845, 2017. Zhang, and Jun Guo. Learning semi-supervised represen- [Yu et al., 2010] Shi Yu, Tillmann Falck, Anneleen Dae- tation towards a unified optimization framework for semi- men, Leon-Charles Tranchevent, Johan AK Suykens, Bart supervised learning. In Proceedings of the IEEE Interna- De Moor, and Yves Moreau. L 2-norm multiple ker- tional Conference on Computer Vision, pages 2767–2775, nel learning and its application to biomedical data fusion. 2015. BMC bioinformatics, 11(1):309, 2010. [Liu et al., 2009] Jun Liu, Jianhui Chen, Songcan Chen, and [Zhou et al., 2004] Denny Zhou, Olivier Bousquet, Jieping Ye. Learning the optimal neighborhood kernel for Thomas N Lal, Jason Weston, and Bernhard Scholkopf.¨ classification. In IJCAI, pages 1144–1149, 2009. Learning with local and global consistency. In Advances [Liu et al., 2013] Xinwang Liu, Jianping Yin, Lei Wang, in neural information processing systems, pages 321–328, Lingqiao Liu, Jun Liu, Chenping Hou, and Jian Zhang. An 2004. adaptive approach to learning optimal neighborhood ker- [Zhu et al., 2003] Xiaojin Zhu, Zoubin Ghahramani, and nels. IEEE transactions on cybernetics, 43(1):371–384, John D Lafferty. Semi-supervised learning using gaussian 2013. fields and harmonic functions. In Proceedings of the 20th [Liu et al., 2017] Xinwang Liu, Sihang Zhou, Yueqing International conference on Machine learning (ICML-03), Wang, Miaomiao Li, Yong Dou, En Zhu, Jianping Yin, pages 912–919, 2003. and Han Li. Optimal neighborhood kernel clustering with multiple kernels. In AAAI, pages 2266–2272, 2017.

2318