This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Self-Paced Clustering Ensemble

Peng Zhou ,LiangDu,Member, IEEE,XinwangLiu , Yi-Dong Shen , Mingyu Fan , and Xuejun Li

Abstract— The clustering ensemble has emerged as an impor- initializations. To address these problems, the idea of a clus- tant extension of the classical clustering problem. It provides an tering ensemble has been proposed. elegant framework to integrate multiple weak base clusterings Clustering ensemble provides an elegant framework for to generate a strong consensus result. Most existing clustering ensemble methods usually exploit all data to learn a consensus combining multiple weak base clusterings of a data set to clustering result, which does not sufficiently consider the adverse generate a consensus clustering [2]. In recent years, many effects caused by some difficult instances. To handle this prob- clustering ensemble methods have been proposed [3]–[7]. For lem, we propose a novel self-paced clustering ensemble (SPCE) example, Strehl et al. and Topchy et al. proposed informa- method, which gradually involves instances from easy to difficult tion theoretic-based clustering ensemble methods, respectively, ones into the ensemble learning. In our method, we integrate the evaluation of the difficulty of instances and ensemble learning in [3] and [4]; Fern et al. extended graph cut method into into a unified framework, which can automatically estimate clustering ensemble [8]; and Ren et al. proposed a weighted- the difficulty of instances and ensemble the base clusterings. object graph partitioning algorithm for clustering ensemble [9]. To optimize the corresponding objective function, we propose a These methods try to learn the consensus clustering result joint learning algorithm to obtain the final consensus clustering from all instances by taking advantage of diversity between result. Experimental results on benchmark data sets demonstrate the effectiveness of our method. base clusterings and reducing the redundancy in the clustering ensemble. However, since the base clustering results may Index Terms— Clustering ensemble, consensus learning, self- not be entirely reliable, it is inappropriate to always use all paced learning. data for clustering ensemble. Intuitively, some instances are I. INTRODUCTION difficult for clustering or even outliers, which leads to the poor performance of the base clusterings. At the beginning LUSTERING is a fundamental unsupervised problem of learning, these difficult instances may mislead the model Cin machine learning tasks. It has been widely used in because the early model may not have the ability to handle various applications and demonstrated promising performance. these difficult instances. However, according to [1], conventional single clustering algo- To tackle this problem, we ensemble the base clusterings rithms usually suffer from the following problems: 1) given in a curriculum learning framework. Curriculum learning is a data set, different structures may be discovered by various proposed by Bengio et al. [10], which incrementally involves clustering methods due to their different objective functions; instances (from easy to difficult ones) into learning. The 2) for a single clustering method, since no ground truth is key idea is that, in the beginning, the model is relatively available, it could be hard to validate the clustering results; weak, and thus, it needs some easy instances for training. and 3) some methods, e.g., k-means, highly depend on their Then, the ability of the model becomes increasingly strong Manuscript received June 23, 2019; revised December 5, 2019 and as time goes on so that it can handle more and more difficult March 15, 2020; accepted March 28, 2020. This work was supported in instances. Finally, it is strong enough to handle almost all part by the National Natural Science Fund of under Grant 61806003, instances. To formulate this key idea of curriculum learning, Grant 61976129, Grant 61922088, Grant 61976205, Grant 61772373, and Grant 61972001 and in part by the Key Natural Science Project of the Anhui we propose a novel self-paced clustering ensemble (SPCE) Provincial Education Department under Grant KJ2018A0010. (Corresponding method, which can automatically evaluate the difficulty of author: Yi-Dong Shen.) instances and gradually include instances from easy to difficult Peng Zhou is with the School of Computer Science and Technology, Anhui University, Hefei 230601, China, and also with the State Key Laboratory ones into the ensemble learning. of Computer Science, Institute of Software, Chinese Academy of Sciences, In our method, we estimate the difficulty of instances with Beijing 100190, China (e-mail: [email protected]). the agreement of base clustering results, i.e., if many base Liang Du is with the School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China (e-mail: [email protected]). clustering results agree with each other in some instances, Xinwang Liu is with the College of Computer, National University these instances may be easy for clustering. We adapt this idea of Defense Technology, Changsha 410073, China (e-mail: xinwangliu@ to the ensemble method and propose a self-paced learning nudt.edu.cn). Yi-Dong Shen is with the State Key Laboratory of Computer Science, method that evaluates the difficulty of instances automatically Institute of Software, Chinese Academy of Sciences, Beijing 100190, China in the process of the ensemble. On the one hand, easy instances (e-mail: [email protected]). can be helpful to ensemble learning; on the other hand, with Mingyu Fan is with the College of Maths and Information Science, University, Wenzhou 325035, China (e-mail: [email protected]). the learning process, more and more instances become easy Xuejun Li is with the School of Computer Science and Technology, Anhui for learning. Since the clustering result represents the relation University, Hefei 230601, China (e-mail: [email protected]). between two instances, i.e., it indicates whether two instances Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. belong to the same cluster or not, we transform all base Digital Object Identifier 10.1109/TNNLS.2020.2984814 clustering results into connective matrices and try to learn 2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS a consensus connective matrix from them. We use a weight information; then, Topchy et al. [4] combined clusterings matrix to represent the difficulty of all pairs in the connective based on the observation that the consensus function of matrix, i.e., the larger the weight of a pair is, the easier to clustering ensemble is related to classical intraclass variance decide whether such two instances belong to the same cluster criterion using the generalized mutual information definition. is. Then, we integrate the weight matrix learning and the In this article, we follow the problem setting of clustering consensus connective matrix learning into a unified objective ensemble defined in [3] and [4]. In more detail, let X = function. To optimize this objective function, we provide a {x1, x2,...,xn} beadatasetofn data points. Suppose that block coordinate descent schema that can jointly learn the we are given a set of m clusterings C ={C1, C2,...,Cm} of consensus connective matrix and the weight matrix. thedatainX , each clustering Ci consisting of a set of clusters {πi ,πi ,...,πi } Ci The extensive experiments are conducted on benchmark data 1 2 k ,wherek is the number of clusters in and X = k πi sets, and the results demonstrate the effectiveness of our self- j=1 j . Note that the number of clusters k could be paced learning method. different for different clusterings. According to [2]–[4], the This article is organized as follows. Section II describes goal of clustering ensemble is to learn a consensus partition some related work. Section III presents in detail the main of the data set from the m base clusterings C1,...,Cm. algorithm of our method. Section IV shows the experimental In recent years, to learn the consensus partition, more results, and Section V concludes this article. and more techniques have been applied to ensemble base clustering results. For example, Zhou and Tang [5] proposed II. RELATED WORK an alignment method to combine multiple k-means clustering In this section, we first present the basic notations and then results. Some works applied the famous matrix factorization introduce some related works. Throughout this article, we use to the clustering ensemble. For instance, Li et al. [23] and boldface uppercase and lowercase letters to denote matrices Li and Ding [24] factorized the connective matrix into two and vectors, respectively. The (i, j)th element of a matrix indicator matrices by symmetric nonnegative matrix factor- M is denoted as Mij,andtheith element of a vector v is ization. Besides k-means and matrix factorization, spectral n×d denoted  as vi . Given a matrix M ∈ R ,weuseMF = clustering was also extended into clustering ensemble tasks, ( n d 2 )1/2 i=1 j=1 Mij to denote its Frobenius norm. We use such as in [25]–[27]. Some methods introduced a probabilistic M0 to denote its 0-norm, which is the number of nonzero graphical model into the clustering ensemble. For example, elements in M.Since0-norm is nonconvex and discontinu- Wang et al. [28] applied a Bayesian method to clustering ous, 1-norm is often used as an approximation  of 0-norm. ensemble; Huang et al. [29] learned a consensus clustering    = n d | | 1-norm of M is defined as M 1 i=1 j=1 Mij . result with a factor graph. Since the clustering diversity and quality are essential in ensemble learning, many methods made A. Clustering Ensemble full use of the diversity and quality to combine base cluster- Ensemble learning trains multiple learners and tries to ings. For example, Abbasi et al. [30] proposed a new stability combine their predictions to achieve better learning perfor- measure called edited normalized mutual information (NMI) mance [11]. Since the generalization ability of the ensem- and used it to ensemble base clusterings; Bagherinia et al. [31] ble method could be better than the base learners [12], provided a fuzzy clustering ensemble by considering the ensemble learning has been applied to various domains, such diversity and quality of base clusterings. as image analysis [13], [14], medical diagnosis [15], and Besides these works that ensembled all base clustering multiview data analysis [16]–[20]. At an early age, many results, some works tried to select some informative and ensemble methods were designed for supervised learning, nonredundant base clustering results for the ensemble. For in which the labels of training data were known. For exam- example, Azimi and Fern [32] proposed an adaptive clus- ple, Freund and Schapire [21] proposed the famous AdaBoost tering ensemble selection method to select the base results; method that evaluated the base learners and then applied the Hong et al. [33] selected base clusterings by a resampling evaluation results to weight each base learner and change method; Parvin and Minaei-Bidgoli [34], [35] proposed a the training data distribution; Friedman [22] proposed the weighted locally adaptive clustering for clustering ensemble gradient boosting decision tree method that ensembled the selection; Yu et al. [36] transferred the clustering selection to results of multiple decision trees. In these methods, the labels feature selection and designed a hybrid strategy to select base of training data are necessary for eliminating the ambiguity results; Zhao et al. [37] proposed internal validity indices for when combining the base learners [5]. clustering ensemble selection; and Shi et al. [38] extended the However, in unsupervised learning, due to the lack of transfer learning into clustering ensemble leading to a transfer training labels, it is more challenging to design the ensemble clustering ensemble selection method. methods. Moreover, as introduced earlier, conventional single In this article, we will propose a clustering ensemble method clustering methods often suffer from stable and robust based on connective matrices. Since the clustering result problems. Therefore, the clustering ensemble has attracted represents the relation between two instances as introduced increasing attention in recent years. At an early age, some earlier, from C, following [24], [39]–[42], we can construct information theoretic-based methods are proposed. For the connective matrix S(i) ∈ Rn×n for partition Ci as example, Strehl and Ghosh [3] first introduced the clustering  ensemble task and formalized clustering ensemble as a ( ) 1, if x and x belong to the same cluster, S i = p q combinatorial optimization problem in terms of shared mutual pq 0, otherwise.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 3

The target of our clustering ensemble method is to learn a TABLE I ( ) ( ) ( ) consensus matrix S from S 1 , S 2 ,...,S m and then obtain NOTATIONS AND DESCRIPTIONS USED IN OUR METHOD the final clustering result from the consensus matrix S. Traditional connective matrix-based methods [24], [39]–[42] constructed coassociation matrix by linearly combining all connective matrices and then obtain the consensus cluster- ing result from coassociation matrix. Different from them, which use all instances for the ensemble, we ensemble the base clusterings in a curriculum learning framework, which incrementally involves instances (from easy to difficult ones) into ensemble learning.

B. Self-Paced Learning Inspired by the learning process of humans, Bengio et al. [10] proposed note of curriculum learning. The idea is to incrementally involve instances into learning, where easy instances are involved first and harder ones are then introduced gradually. One benefit of this strategy is that III. SELF-PACED CLUSTERING ENSEMBLE it helps alleviate the local optimum problem in nonconvex optimization, as introduced in [43] and [44]. In this section, we provide the framework of our SPCE To formulate the key principle of curriculum learning method. The main notations and their descriptions used in this that gradually includes instances from easy to difficult ones, section are shown in Table I. Kumar et al. [45] proposed the self-paced learning. More for- mally, given a data set D ={(x1, y1), (x2, y2),...,(xn, yn)} A. Mining the Most Certain Information ∈ Rd containing n instances, where xi is the d-dimensional As we know, self-paced learning gradually incorporates easy feature vector of the ith instance and yi is its label, to more complex samples into training. In our task, since we ( , ( , θ)) L yi g xi is denoted as the loss function that indicates handle m connective matrices S(1), S(2),...,S(m),thesamples the cost between the ground truth yi and the estimated label are the data pairs appearing in S(i). To apply the self-paced ( , θ) θ g xi ,where represents the model parameters in the learning, we first find the easiest or the most certain data pairs. decision function g. According to [46] and [47], the self-paced Here, a voting method is used to find the most certain pairs. learning introduces a weighted loss term on all instances and Given any instance pair xp and xq,ifallm connective matrices a general regularizer term on instance weights, which is in the agree that xp and xq belong to or not belong to the same following form: cluster, we regard this pair as the most certain pair. More n formally, we compute the coassociation matrix Sˆ as min (wi L(yi , g(xi , θ)) + f (wi ,λ)) (1) θ, m w 1 ( ) i=1 Sˆ = S i . (2) m where λ is an age parameter for controlling the learning i=1 w ( ,λ) pace, i is the weight of the ith instance, and f w It is easy to verify that for any 1 ≤ p, q ≤ n,wehave θ is the self-paced regularizer. When fixing , supposing that 0 ≤ Sˆ ≤ 1. Sˆ = 1 indicates that all connective matrices = ( , ( , θ)) w∗(λ, ) pq pq li L yi g xi and i li is the optimum weight of agree that x and x belong to the same cluster, and Sˆ = 0 λ (w ,λ) p q pq the ith instance, which is relative with and li , f i indicates that all connective matrices agree that they belong w∗(λ, l ) should satisfy that i i is monotonically decreasing with to the different clusters. Thus, the most certain pairs are the λ li and increasing with , as suggested in [48]–[50]. Since elements in Sˆ, which are either 1 or 0. To make the learned w∗(λ, ) i li is monotonically decreasing with li , easy instances, consensus matrix S to preserve such certain information, which have low loss, will have large weight, which means that ∗ we directly set S as we learn from these instances first. w (λ, l ) is monotonically ⎧ i i ˆ increasing with λ indicates that with the learning process ⎨1, if Spq = 1, λ = , ˆ = ( grows), more and more instances are used for learning. Spq ⎩0 if Spq 0, Therefore, the process of self-paced learning is to missing, otherwise. optimize (1) via alternating minimization. Fixing θ and Therefore, the consensus matrix S can be learned by solving solving w are to learn the weight of each instance; fixing w a matrix completion problem, i.e., we fill the missing values and solving θ are to learn the model using the easy instances. in S by self-paced consensus learning. Due to the promising performance, self-paced learning has been applied to handle many machine learning tasks, such as multitask learning [51], [52] and robust classification B. Self-Paced Consensus Learning [53]. In this article, we will extend this self-paced learning Since S is the consensus matrix, we wish to minimize framework into unsupervised ensemble learning. the disagreement between it and all connective matrices.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

 m  − (i)2 A natural idea is to minimize i=1 S S F .However, Theorem 1 [56]: The multiplicity c of the eigenvalue 0 of this objective treats all m clustering results equally, which the Laplacian matrix L is equal to the number of connected may be inappropriate. Intuitively, the quality of each clustering components in the consensus matrix S. result is different, and we wish the better clustering results Theorem 1 indicates that if rank(L) = n −c, where rank(L) contribute more in the consensus learning. Thus, we can denotes the rank of matrix L, then we already partition the m α  − (i)2 α modify the objective to i=1 i S S F ,where i is instances into c clusters based on S without any discretization the weight of the ith base clustering result. Next, we should procedures, such as k-means. Motivated by this theorem, decide the weight of each clustering result. Inspired by the we add the constraint rank(L) = n−c to the objective function autoweighted technique proposed in [54] and [55], we can m α = ( /( − (i) )) (i) (i) define i 1 S S F , which means the closer S to min (S − S )  WF − λW1 S,W the consensus matrix, the larger the weight of the ith clustering i=1 ˆ result is. Thus, we obtain the following objective function: s.t. S   = S  , 0 ≤ Spq ≤ 1 ∀p, q, ( ) = − , m rank L n c (i) min S − S 0 ≤ Wpq ≤ 1 ∀p, q. (5) S F = i 1 Last but not least, to obtain a clearer clustering structure,   = ˆ  , ≤ ≤ ∀ , s.t. S S 0 Spq 1 p q (3) we wish the consensus matrix S to be as sparse as possible so that S can represent a clear graph structure. To achieve this, where  ∈ Rn×n is an indicator matrix, whose element we impose a sparse regularized term S on the objective  = 1ifSˆ = 0or1and = 0otherwise. is the 0 pq pq pq function and obtain the following formula: Hadamard product, which means the elementwise production m of two matrices. The constraint is to make sure that the (i) min (S − S )  WF − λW1 + γ S0 consensus matrix S preserves the certain information. S,W To modify (3) into the self-paced learning framework, i=1   = ˆ  , ≤ ≤ ∀ , , we should decide which data pairs are easy samples. Here, s.t. S S 0 Spq 1 p q ( ) = − , we follow the idea of voting introduced before. Given a rank L n c (i) ≤ ≤ ∀ , data pair xp and xq,ifmostS agrees with each other, 0 Wpq 1 p q (6) we believe that this pair is an easy pair. More formally, where γ is a balancing hyperparameter that can adjust the m (i) 2 the smaller = (Spq−S ) is, the easier the pair (xp, xq ) is. i 1 pq sparsity of S. Note that we use 0-norm here instead of 1- To this end, we introduce a weight matrix W whose element norm or any other convex or nonconvex approximation, which ≤ ≤ ( , ) 0 Wpq 1 indicates the weight of the pair xp xq . can make S as sparse as possible. The larger Wpq is, the easier this pair is. Then, following Since (6) involves the Frobenius norm and rank function (w, λ) the self-paced learning framework, we set f in (1) as that are difficult to optimize, we should relax (6) to simplify ( ,λ)=−λ  f W W 1 and obtain the following formula: the optimization. m First, we handle the Frobenius norm. By introducing auxil- (i) α ≤ α ≤ m α = min (S − S )  WF − λW1 iary weight variables ,where0 i 1and i=1 i 1, S,W i=1 we have the following theorem. m ( ) ˆ Theorem 2: , (S − S i )  W s.t. S   = S  , 0 ≤ Spq ≤ 1 ∀p, q minS W i=1 F is equivalent m ( /α )( − (i))  2 to minS,W,α i=1 1 i S S W F . 0 ≤ Wpq ≤ 1 ∀p, q (4) ∗ ∗ ∗ Proof: Let S , W and α denote the optima of m ( /α )( − (i)) 2 ∗ where λ is the age parameter and becomes increasingly larger minS,W,α i=1 1 i S S W F .ToprovethatS and ∗ m ( − (i))   in the process of optimization. W are also the optima of minS,W i=1 S S W F , we need to prove that given any S˜ and W˜ ,wehave m m C. Overall Objective Function ∗ (i) ∗ ˜ (i) ˜ (S − S )  W F ≤ (S − S )  WF . Equation (4) provides a self-paced framework to learn the i=1 i=1 consensus matrix, and now, we need to transform it into Consider that the final clustering result. Suppose that we want to partition m 2 ∗ (i) ∗ the instances into c clusters; the easiest way is to make S (S − S )  W F contain just c connected components. Note that we obtain i=1 the clustering result by finding the c connected compo- m m 1 ∗ ( ) ∗ ∗ ≤ (S − S i )  W 2 α nents without any uncertain discretization procedures, such as α∗ F i k-means. i=1 i i=1 = − m Let L be the Laplacian matrix of S,i.e.,L D 1 ∗ (i) ∗ 2 T = ∗ (S − S )  W  ((S + S )/2),whereDis a diagonal matrix whose pth diago- α F = (( + )/ ) i=1 i nal element is Dpp q Spq Sqp 2 .IfS is nonnegative, m then the Laplacian matrix L has an important property as ≤ 1 (˜ − (i))  ˜ 2 . S S W F (7) α˜i follows. i=1

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 5

The first inequality is due to the Cauchy–Schwarz inequality, from which an optimal consensus matrix can be chosen. The m α∗ = the second equality is due to i= 1, and the third similar results can also be found in some previous research, ∗ ∗1 i ∗ inequality is because since S , W and α are the optima such as in [58]–[60]. m ( /α )( − (i))  2 α˜ of minS,W,α i=1 1 i S S W F , given any i , ˜ ˜ m ( /α∗)( ∗ − (i))  ∗2 ≤ D. Optimization S and W,wehave i=1 1 i S S W F m ( /α˜ )(˜ − (i))  ˜ 2 i=1 1 i S S W F . Now, we introduce how to optimize (10). Since (10) Note that α˜ i can take any value that satisfies involves four groups of variables, we will use a block coordi- α˜ = α˜ = i=1 i 1 in (7). Specially, we set i nate descent schema to optimize it. More specifically, we will ((˜ − (i))  ˜  )/( m (˜ − (i))  ˜  ) S S W F i=1 S S W F . Then, optimize one group of variables while fixing the other variables we have and repeat this procedure until it converges.

m 2 1) Optimize W: We remove the terms that are irrelative to ∗ (i) ∗ W and obtain (S − S )  W F m =  i 1  1  (i)  2 − λ  m m ˜ (i) ˜ min A W F W 1 = (S − S )  W W αi ≤ i 1 F (S˜ − S(i))  W˜ 2 i=1 (˜ − (i))  ˜  F = S S W F s.t. 0 ≤ Wpq ≤ 1 ∀p, q (11) i 1 m 2 where A(i) = S − S(i). ˜ (i) ˜ = (S − S )  WF (8) It is easy to verify that (11) can be decoupled into n × n i=1 independent subproblems. Considering one of them whichleadstothat 2 − λ min BpqWpq Wpq m m Wpq ( ∗ − (i))  ∗ ≤ (˜ − (i))  ˜  . S S W F S S W F s.t. 0 ≤ Wpq ≤ 1 (12) = =  i 1 i 1 = m ( 2 /α ) where Bpq i=1 A pq i . This completes the proof. Setting the partial derivative of (12) with respect to Wpq According to Theorem 2, we relax (6) as to zero, we obtain that Wpq = (λ/2Bpq).SinceBpq ≥ 0, m ≥ (λ/ )> 2 − λ  Wpq 0. If 2Bpq 1, i.e., BpqWpq Wpq decreases 1 ( − (i))  2 − λ  + γ   min S S W F W 1 S 0 monotonically in the range [0, 1], so the optima is 1. To sum S,W,α αi i=1 up, we optimize Wpq as   = ˆ  , ≤ ≤ ∀ , , s.t. S S 0 Spq 1 p q λ rank(L) = n − c, Wpq = min , 1 . (13) 2Bpq 0 ≤ Wpq ≤ 1 ∀p, q. m From (13), we see that λ corresponds to the "age" of the 0 ≤ αi ≤ 1, αi = 1. (9) model. When λ is small, the weight of the most samples is i=1 small, and only easy samples with small losses (Bpq) influence λ Then, we handle the rank function. According to [56], since the model much. As grows, more samples with large losses will gradually influence the model. This accords with the we wish the rank of L is n−c, i.e., the c smallest eigenvalues of c motivation of self-paced learning. L are 0s, we try to minimize = σi (L),whereσi (L) denotes i 1 2) Optimize S: When other variables are fixed, the ith smallest eigenvalues of L. According to Ky Fan’s theo- c σ ( ) = ( T ) we rewrite (10) as rem [57], we have i=1 i L minY∈Rn×c,YT Y=I tr Y LY . Thus, by introducing orthogonal matrix Y ∈ Rn×c, we can m 1 ( − (i))  2 + γ   reformulate (9) as follows: min S S W F S 0 S αi i=1 m  n 1 ( − (i))  2 min S S W F + ρ  − 2 S,W,α,Y α yp yq Spq = i 2 i 1 , = − λ  + γ   + ρ ( T ) p q 1 W 1 S 0 tr Y LY ˆ ˆ s.t. S   = S  , 0 ≤ Spq ≤ 1 ∀p, q (14) s.t. S   = S  , 0 ≤ Spq ≤ 1 ∀p, q, 0 ≤ Wpq ≤ 1 ∀p, q where yp and yq are the pth and qth row vectors in Y, m respectively. ≤ α ≤ , α = 0 i 1 i 1 Define a function g(x) where g(x) = 1ifx = 0and i=1 g(x) = 0 otherwise. Equation (14) can also be decoupled T = Y Y I (10) into n × n independent subproblems. Since S   = Sˆ  ,  = where ρ is a large enough parameter to make sure that the we just need to consider Spq whose pq 0 rank of L is n −c. Note that different from some conventional m   1 − (i) 2 2 +γ ( ) + ρ − 2 ensemble methods that linearly combine all base connective min Spq Spq Wpq g Spq yp yq 2 Spq Spq αi matrices, our S is more like a nonparametric consensus matrix, i=1 ≤ ≤ . which is more flexible and can effectively enlarge the region s.t. 0 Spq 1 (15)

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Equation (15) can be simplified further as m−k results agrees that they belong to different clusters. Then, 2 we have min (Spq − C pq) + τ g(Spq) S m (i) 2 pq (Spq − S ) B = pq s.t. 0 ≤ Spq ≤ 1 (16) pq α = i i 1 where k 2 k 2  (i) ρ − 2 = − 1 k + (m − k) m. (22) m Spq yp yq 2 = − m m i 1 αi 2W 2 C =  pq pq m 1 = ( / ) = (( − )2 + = Let r denote r k m ,wegetBpq r 1 r i 1 αi  r 2(1 − r))m2. Obviously, r indicates how many results agree τ = (γ /( m ( 2 /α ))) and i=1 Wpq i . with each other. For example, r = 0.9 means that 90% Equation (16) has a closed-form solution ⎧ results agrees that the pair belongs to the same cluster. , ≥ Thus, the larger r is, the easier the pair is. In our method, ⎨1 if √C pq 1 = , τ ≤ < we initialize r = 0.9, and set λ Spq ⎩C pq if √C pq 1 (17) 0, C < τ. 2 2 2 pq λ = 2Bpq = 2((r − 1) r + r (1 − r))m (23) 3) Optimize Y: When optimizing Y,wehave which means for xp and xq , if 90% clustering results agrees = min tr(YT LY) with each other, then we set Wpq 1, i.e., we use this pair Y completely. s.t. YT Y = I. (18) In the following learning, we gradually increase λ by decreasing r from 0.9 to 0.5. According to Ky Fan’s theorem [57], the solution of Y is Then, we discuss how to choose the hyperparameter γ . formed by the c eigenvectors of L corresponding to the c As we know, γ controls the sparsity of S. From (17), we find smallest eigenvalues. √ that if C < τ,wehaveS = 0andγ is proportional 4) Optimize α: Let d denote (S − S(i))  W2 ,andwe pq pq i F to τ; thus, γ plays a role as a threshold. More specifically, have Spq = 0when m d   min i √  γ γ α  αi C pq < τ =  ≈  . (24) i=1  W 2 m 1 m pq = = i 1 αi m i 1 αi s.t. 0 ≤ α ≤ 1, α = 1. (19) i i The approximate equals sign is due to that, at last, almost all i=1 pairs are involved in the learning; thus, all Wpq ≈ 1. Then, According to the Cauchy–Schwarz inequality, we have according to the harmonic mean inequality, we have  m m m m 2   √ d d  m i = i α ≥ . γ γ = αi γ i di (20) C <  ≤ i 1 = . (25) αi αi pq m 1 2 i=1 i=1 i=1 i=1 = α m m √ i 1 i α ∝ √ The equality in (20) holds when i di . Thus, the closed- Denote θ = ( γ/m). θ can be viewed as a threshold, form solution of (19) is i.e., Spq is nonzero when C pq >θ. Therefore, we can easily √ 2 2 d give a threshold θ and compute γ by γ = m θ instead α =  i . i m (21) of directly setting γ . For example, if we wish to keep Spq j=1 d j nonzero when Spq > 0.5, we can easily set θ = 0.5and obtain γ = m2θ 2 = 0.25 m2. E. Discussion Last but not least, it is worthy to discuss another related In this section, we first introduce how to initialize the method here, called the robust clustering ensemble, which variables involved in our objective function and then discuss focuses on the robustness of the clustering ensemble methods. how to choose the hyperparameter; at last, we discuss the It captures the noises from data or base clustering results relations and differences of our method and robust clustering and recovers clean results for the ensemble. For example, ensemble methods.  Zhou et al. [39] learned a robust consensus clustering result = m ( / ) (i) We initialize S i=1 1 m S and construct L from S. via minimizing the Kullback–Leibler divergence among Then, we initialize Y by solving (18). We set ρ = 1atfirst each base result; Tao et al. [40], [61] presented robust and adjust it automatically by observing the rank of L.We clustering ensemble methods based on spectral clustering; initialize αi = (1/m). Huang et al. [62] applied probability trajectories to robust We initialize W by (13). However, in (13), we need to clustering ensemble; and Liu et al. [63] proposed an ensemble decide λ first. In the initialization, we have set S as mean method on incomplete data. In these methods, they only (i) of S and αi = (1/m), and then, we take a closer look focus on the noises or outliers without distinguishing between at Bpq. Suppose that for the pair (xp, xq ),therearek clustering uncontaminated instances. In our method, the contaminated results agrees that they should belong to a cluster and the other instances can be viewed as the most difficult instances

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 7 because they may contribute nothing to the learning. Besides, TABLE II the uncontaminated instances can also be handled in order DESCRIPTION OF THE DATA SETS of difficulty. Therefore, our method handles instances more finely. Moreover, the difficulty of instances is always changing in the process of learning, i.e., most instances become increasingly easier as time passes by. In our method, we estimate the difficulty of instances (W) automatically in the process of the ensemble. From (13), we observe that W is proportional to λ, i.e., when time goes on (λ increases), W will also increases until it reaches 1. Therefore, this property can be well characterized by our method.

F. Whole Algorithm Algorithm 1 summarizes the whole process of the SPCE method. Note that in the inner iteration (Lines 6–17), the algo- rithm optimizes S, Y,andα. Since the solution of each step is the global optima of the corresponding subproblem, which makes the objective function decrease monotonically and the objective function has a lower bound, the iteration will always The time and space complexity of our method is comparable converge. with the existing connective/coassociation matrix-based meth- ods [39]–[41]. Despite this, we plan to reduce the computation Algorithm 1 SPCE complexity of the proposed method in future work. Input: m connective matrices S(1),...,S(m), number c of clusters, threshold θ. IV. EXPERIMENTS Output: Final clusters. In this section, we compare our SPCE with several state-of- ˆ  ˆ 1: Construct S by Eq.(2) and construct from S. the-art clustering ensemble methods on benchmark data sets. 2: Initialize the parameters as introduced in Section III-E. 3: for r = 0.9, 0.8, ··· , 0.5 do 4: Compute λ by Eq.(23). A. Data Sets 5: Compute W by Eq.(13). We use totally 12 data sets to evaluate the effectiveness 6: while not converge do of our proposed SPCE, including AR [64], Coil20 [65], 7: Compute S by Eq.(17). GLIOMA [66], K1b [67], Lung [68], Medical [39], Tr41 [67], 8: Compute Y by solving Eq.(18). Tdt2 [69], TOX [66], UMIST [70], WebACE [71], and 9: Compute α by Eq.(21). WarpAR [66]. Data sets from different areas serve as a good 10: if The rank of L is larger than n − c then test bed for a comprehensive evaluation. The basic information 11: ρ = 2ρ. of these data sets is summarized in Table II. 12: else if The rank of L is smaller than n − c then 13: ρ = ρ/2. 14: else B. Compared Methods 15: Break. We compare our SPCE with the following algorithms. 16: end if 1) KM/SC: These are the average of all the base k-means 17: end while and spectral clustering results, respectively. 18: end for 2) KM-best/SC-best: These are the best result of all the base 19: Obtain the final clusters from the c connective component k-means and spectral clustering results, respectively. in S. 3) KC: It represents the results of applying k-means to a consensus similarity matrix, and it is often used as a baseline in clustering ensemble methods, such as [24], G. Time and Space Complexity [39]. Since we need to save and handle m connective matrices 4) Cluster-Based Similarity Partitioning Algorithm (CSPA) S(1),...,S(m), the space complexity is O(mn2). [3]: It signifies a relationship between objects in the When computing W, we need to compute A(i) (i = same cluster and can, thus, be used to establish a 1,...,m); thus, the time complexity is O(mn2). Computing measure of pairwise similarity. S costs O(n2c + n2m) since we need to compute C first. 5) Hypergraph Partitioning Algorithm (HGPA) [3]: It Computing Y involves an eigenvector decomposition that costs approximates the maximum mutual information objec- O(n2c) time. When computing α, we need to compute d tive with a constrained minimum cut objective. in O(n2m) time. Therefore, the whole time complexity is 6) Metaclustering Algorithm (MCLA) [3]: It transforms O((n2m + n2c)t),wheret is the number of iterations. the integration into a cluster correspondence problem.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE III ACC RESULTS ON ALL THE DATA SETS (k-MEANS BASED)

7) Nonnegative Matrix Factorization-Based Consensus we also run SPCE-W to see the performance of our Clustering (NMFC) [24]: It uses NMF to aggregate method without self-paced learning, i.e., in each itera- clustering results. tion, all weights in W are fixed to 1s. 8) Bayesian Clustering Ensemble (BCE) [28]: It is a 16) SPCE-fixW: It is our method with fixed W,whereWij Bayesian model for ensemble. is proportional to the frequency of the two instances 9) Robust Clustering Ensemble (RCE) [39]: It learns a occurring in the clusters from the given base cluster- robust consensus clustering result via minimizing the ings. To further evaluate the effectiveness of self-paced Kullback–Leibler divergence among each base result. learning in our method, we also run SPCE-fixW with 10) Multiview Ensemble Clustering (MEC) [41]: It is a fixed W, i.e., in each iteration, W is fixed as the initial robust multiview clustering ensemble method using low- value as introduced in Section III-E instead of learning rank and sparse decomposition to ensemble base clus- automatically. tering and detect the noises. 11) Locally Weighted Evidence Accumulation (LWEA) [72]: C. Experimental Setup It is a hierarchical agglomerative clustering ensemble We conduct two groups of experiments that use k-means method based on ensemble-driven cluster uncertainty and spectral clustering results as base clusterings, respec- estimation and local weighting strategy. tively. In the k-means-based clustering ensemble, follow- 12) Locally Weighted Graph Partitioning (LWGP) [72]: It is ing the similar experimental protocol in [28], [39], we run a graph partition method based on the local weighting k-means 200 times with different initializations on all strategy. instances to obtain 200 base clustering results that are 13) Robust Spectral Ensemble Clustering (RSEC) [61]: It is divided evenly into ten subsets, with 20 base results in a robust clustering ensemble method based on spectral each subset. Then, we apply all clustering ensemble methods clustering. on each subset and report the average results on the ten 14) Dense Representation Ensemble Clustering subsets. In spectral-based clustering ensemble, we use the −((x −x 2)/2σ 2) (DREC) [73]: It learns a dense representation for Gaussian kernel k(xi , x j ) = e i j 2 to construct clustering ensemble. the Laplacian matrix for spectral clustering. σ is the band- 15) SPCE-W: It is our method without W.Toevaluate width parameter, and in our experiments, we set σ = d ∗ the effectiveness of self-paced learning in our method, {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100} for ensemble, i.e., we

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 9

TABLE IV NMI RESULTS ON ALL THE DATA SETS (k-MEANS BASED)

Fig. 1. Illustration of clustering structure in the learned consensus matrices from different methods on the K1b data set. (a) Input coassociation matrix. (b)–(d) Learned consensus matrices by robust methods. (f) S in our method. Note that the second cluster and the third cluster are separated more clearly in our method and so are the fourth cluster and the fifth cluster. Thus, the consensus matrix of our method is cleaner than those robust methods. (a) Input coassociation matrix Sˆ. (b) RCE. (c) MEC. (d) RSEC. (e) SPCE. ensemble nine spectral clustering results, where d is the mean results, the widely used accuracy (ACC), NMI, and adjusted distance between all pairs (xi , x j ). We also repeat it ten times rand index (ARI) are reported. To validate the statistic signif- and report the average results. To measure the clustering icance of results, we also calculate the p-value of t-test.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE V ARI RESULTS ON ALL THE DATA SETS (k-MEANS BASED)

TABLE VI ACC RESULTS ON ALL THE DATA SETS (SPECTRAL CLUSTERING BASED)

The number of clusters is set to the true number of classes base clustering results), as introduced in Section III-E, and for all data sets and algorithms. In our method, we set the tune θ ={0, 0.1,...,0.9} to control the sparsity. Note that 2 2 parameter γ as γ = m θ (where m is the number of θ = 0 means we drop the regularized term S0. We tune

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 11

TABLE VII NMI RESULTS ON ALL THE DATA SETS (SPECTRAL CLUSTERING BASED)

the parameters in compared methods, as suggested in their articles. All experiments are conducted using MATLAB on a PC with Windows 10, 4.2-GHz CPU and 32-GB memory.

D. Experimental Results The average ACC, NMI, and ARI results and the stan- dard deviation on the k-means-based ensemble are shown in Tables III–V, respectively. The results on spectral-based ensemble are shown in Tables VI–VIII, respectively. Bold fond indicates that the difference is statistically significant, i.e., the p-value of t-test is smaller than 0.05. Note that since we aim to compare with other clustering ensemble methods, we do not calculate the p-value of KM, KM-best, SC, and SC-best. Due to their high space complexity, RCE and MEC yield no results on the largest data set Tdt2 because they run out of memory. The results reveal some interesting points. Fig. 2. Convergence curves of our method. (a) AR. (b) Coil20. (c) GLIOMA. 1) Many clustering ensemble methods perform better than (d) Lung. KM and SC, which indicates the benefit of ensemble methods. Many methods cannot outperform the KM-best and SC-best at most times. It may be because many base SC-best. In our formulation, we minimize the Frobenius results are not so good, and these bad clustering results norm instead of the square of the Frobenius norm of the may deteriorate the performance of ensemble learning. difference between S and S(i). It is equivalent to add However, the performance of our SPCE is usually the weight on each base clustering, which can reduce close to or even better than the result of KM-best and the side effect of the bad base clusterings. Moreover,

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE VIII ARI RESULTS ON ALL THE DATA SETS (SPECTRAL CLUSTERING BASED)

the self-paced learning framework can also reduce the Table IX shows the running time (with 20 base clusterings effect of hard (or bad) instances. Note that, SPCE for ensemble) of the clustering ensemble methods on all data does not need to perform an exhaustive search on the sets. The underlined data means that the corresponding method predefined pool of base clusterings. Such results well is slower than ours on that data set. From Table IX, we can find demonstrate the superiority of our method. that compared with other connective matrix-based methods, 2) On most data sets, our method outperforms other i.e., RCE, MEC, and RSEC, our method is significantly faster compared methods significantly. Compared with the than them on most data sets. Note that on the largest data set, robust methods RCE, MEC, and RSEC, our method can Tdt2, RCE, and MEC run out of memory, while our method also usually obtain a better performance. This may be can still work. because our method can learn a clearer cluster structure, Fig. 2 shows the algorithm convergence curve on AR, which is illustrated in Fig. 1. In Fig. 1, we show the Coil20, GLIOMA, and Lung data sets, and the results on other input coassociation matrix and the consensus matrices data sets are similar. The example results in Fig. 2 demonstrate learned from these robust ensemble methods on the that our method converges within a small number of iterations. K1b data set. We can see that the consensus matrix learned by our SPCE method is cleaner than other robust methods (the second cluster and the third cluster E. Parameter Study are separated more clearly in our method and so are the Our method contains only one hyperparameter (0 ≤ θ<1), fourth cluster and the fifth cluster), which demonstrates which is needed to set manually. As discussed in Section III-E, the effectiveness of our method. θ plays a role as a threshold that controls the sparsity of 3) SPCE-fixW performs better than SPCE-W, which means the consensus matrix S, i.e., the larger θ is, the sparser S imposing the weights on instances can indeed improve is. We tune θ from {0, 0.1,...,0.9} and show the results the performance. Compared with SPCE-fixW, SPCE in Fig. 3. The results of other data sets are similar. Note that outperforms it on most data sets. It demonstrates the θ = 0 means that we drop the term S0 in our formulation, effectiveness of self-paced learning in our framework. and the results show that our method does not perform well Learning from easy instances and involving difficult without this term (θ = 0), which demonstrates its necessity. instances gradually can further improve the performance From Fig. 3, we can select θ from [0.2, 0.6], which can often of the clustering ensemble. obtain a relatively good performance.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 13

TABLE IX RUNNING TIME ON ALL THE DATA SETS WITH 20 BASE CLUSTERINGS (s)

In the future, we will consider the scalable issue and try to reduce the time and space complexity of our method.

REFERENCES [1] F. Wang, X. Wang, and T. Li, “Generalized cluster aggregation,” in Proc. IJCAI, 2009, pp. 1279–1284. [2] A. Topchy, A. K. Jain, and W. Punch, “A mixture model for cluster- ing ensembles,” in Proc. SIAM Int. Conf. Data Mining, Apr. 2004, pp. 379–390. [3] A. Strehl and J. Ghosh, “Cluster ensembles—A knowledge reuse frame- work for combining multiple partitions,” J. Mach. Learn. Res.,vol.3, no. 3, pp. 583–617, 2003. [4] A. Topchy, A. K. Jain, and W. Punch, “Combining multiple weak clus- terings,” in Proc. 3rd IEEE Int. Conf. Data Mining, 2003, pp. 331–338. [5] Z.-H. Zhou and W. Tang, “Clusterer ensemble,” Knowl.-Based Syst., vol. 19, no. 1, pp. 77–83, Mar. 2006. [6] F.-J. Li, Y.-H. Qian, J.-T. Wang, and J.-Y. Liang, “Multigranulation information fusion: A Dempster-Shafer evidence theory based clustering ensemble method,” in Proc. Int. Conf. Mach. Learn. Cybern. (ICMLC), vol. 1, Jul. 2015, pp. 58–63. [7] H. Liu, M. Shao, S. Li, and Y. Fu, “Infinite ensemble clustering,” Data Mining Knowl. Discovery, vol. 32, no. 2, pp. 385–416, Mar. 2018. [8] X. Z. Fern and C. E. Brodley, “Solving cluster ensemble prob- lems by bipartite graph partitioning,” in Proc. 21st Int. Conf. Mach. Learn. (ICML), 2004, p. 36. [9] Y. Ren, C. Domeniconi, G. Zhang, and G. Yu, “Weighted-object ensem- ble clustering: Methods and analysis,” Knowl. Inf. Syst., vol. 51, no. 2, pp. 661–689, May 2017. [10] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum θ θ Fig. 3. ACC and NMI with respect to . (a) ACC with respect to on learning,” in Proc. ICML, 2009, pp. 41–48. θ θ AR. (b) NMI with respect to on AR. (c) ACC with respect to on Coil20. [11] T. G. Dietterich et al., “Ensemble learning,” The Handbook of Brain θ θ (d) NMI with respect to on Coil20. (e) ACC with respect to on Lung. Theory and Neural Networks, vol. 2. Cambridge, MA, USA: MIT Press, θ (f) NMI with respect to on Lung. 2002, pp. 110–125. [12] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 10, pp. 993–1001, Oct. 1990. [13] H. Yu, J. Wang, Y. Bai, W. Yang, and G.-S. Xia, “Analysis of large- ONCLUSION V. C scale UAV images using a multi-scale hierarchical representation,” Geo- In this article, we proposed a novel SPCE method. Different Spatial Inf. Sci., vol. 21, no. 1, pp. 33–44, Jan. 2018. [14] X. Deng, J. Chen, H. Li, P. Han, and W. Yang, “Log-cumulants of from the conventional clustering ensemble methods, which the finite mixture model and their application to statistical analysis of use all instances in ensemble learning, we gradually involved fully polarimetric UAVSAR data,” Geo-Spatial Inf. Sci., vol. 21, no. 1, instances in learning from easy to difficult ones. In the self- pp. 45–55, Jan. 2018. [15] Z.-H. Zhou, Y. Jiang, Y.-B. Yang, and S.-F. Chen, “Lung cancer cell paced learning framework, we proposed an effective algorithm identification based on artificial neural network ensembles,” Artif. Intell. to jointly learn the difficulty of instances and the consensus Med., vol. 24, no. 1, pp. 25–36, Jan. 2002. clustering result. We conducted extensive experiments on [16] P. Zhou, Y.-D. Shen, L. Du, F. Ye, and X. Li, “Incremental multi-view spectral clustering,” Knowl.-Based Syst., vol. 174, pp. 73–86, Jun. 2019. benchmark data sets, and the experimental results demon- [17] P. Zhou, Y.-D. Shen, L. Du, and F. Ye, “Incremental multi-view support strated that our method not only outperformed the state-of-the- vector machine,” in Proc. SIAM Int. Conf. Data Mining. Philadelphia, art clustering ensemble methods but also had a closed or even PA, USA: SIAM, 2019, pp. 1–9. [18] Z. Tao, H. Liu, S. Li, Z. Ding, and Y. Fu, “Marginalized multiview better performance compared with the best base clustering ensemble clustering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, result. no. 2, pp. 600–611, Feb. 2020.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[19] S. Wang et al., “Multi-view clustering via late fusion alignment max- [47] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced imization,” in Proc. 28th Int. Joint Conf. Artif. Intell., Aug. 2019, multiple-instance learning framework,” IEEE Trans. Pattern Anal. Mach. pp. 3778–3784. Intell., vol. 39, no. 5, pp. 865–878, May 2017. [20] S. Wang et al., “Efficient multiple kernel k-means clustering with late [48] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-paced fusion,” IEEE Access, vol. 7, pp. 61109–61120, 2019. curriculum learning,” in Proc. AAAI, 2015, pp. 2694–2700. [21] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of [49] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann, on-line learning and an application to boosting,” J. Comput. Syst. Sci., “Self-paced learning for matrix factorization,” in Proc. AAAI, 2015, vol. 55, no. 1, pp. 119–139, Aug. 1997. pp. 3196–3202. [22] J. H. Friedman, “Greedy function approximation: A gradient boosting [50] D. Meng, Q. Zhao, and L. Jiang, “A theoretical understanding of self- machine,” Ann. Statist., vol. 29, pp. 1189–1232, Oct. 2001. paced learning,” Inf. Sci., vol. 414, pp. 319–328, Nov. 2017. [23] T. Li, C. Ding, and M. I. Jordan, “Solving consensus and semi- [51] C. Li, J. Yan, F. Wei, W. Dong, Q. Liu, and H. Zha, “Self-paced supervised clustering problems using nonnegative matrix factoriza- multi-task learning,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, tion,” in Proc. 7th IEEE Int. Conf. Data Mining (ICDM), Oct. 2007, pp. 2175–2181. pp. 577–582. [52] Y. Ren, X. Que, D. Yao, and Z. Xu, “Self-paced multi-task clustering,” [24] T. Li and C. Ding, “Weighted consensus clustering,” in Proc. SIAM Int. Neurocomputing, vol. 350, pp. 212–220, Jul. 2019. Conf. Data Mining, Apr. 2008, pp. 798–809. [53] Y. Ren, P. Zhao, Y. Sheng, D. Yao, and Z. Xu, “Robust softmax [25] H. Liu, T. Liu, J. Wu, D. Tao, and Y. Fu, “Spectral ensemble clustering,” regression for multi-class classification with self-paced learning,” in in Proc. SIGKDD, 2015, pp. 715–724. Proc. 26th Int. Joint Conf. Artif. Intell. Stanford, CA, USA: AAAI Press, [26] Z. Tao, H. Liu, and Y. Fu, “Simultaneous clustering and ensemble,” in Aug. 2017, pp. 2641–2647. Proc. AAAI, 2017, pp. 1546–1552. [54] Z. Kang, X. Lu, J. Yi, and Z. Xu, “Self-weighted multiple kernel learning [27] D. Huang, C.-D. Wang, J. Wu, J.-H. Lai, and C. K. Kwoh, for graph-based clustering and semi-supervised classification,” in Proc. “Ultra-scalable spectral clustering and ensemble clustering,” IEEE 27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 1881–1887. Trans. Knowl. Data Eng., early access, Mar. 6, 2019, doi: 10.1109/ [55] F. Nie, L. Tian, and X. Li, “Multiview clustering via adaptively weighted TKDE.2019.2903410. procrustes,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery [28] H. Wang, H. Shan, and A. Banerjee, “Bayesian cluster ensembles,” in Data Mining, Jul. 2018, pp. 2022–2030. Proc. SDM, 2009, pp. 211–222. [56] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering [29] D. Huang, J. Lai, and C.-D. Wang, “Ensemble clustering using factor with adaptive neighbors,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. graph,” Pattern Recognit., vol. 50, pp. 131–142, Feb. 2016. Discovery Data Mining (KDD), 2014, pp. 977–986. [30] S.-O. Abbasi, S. Nejatian, H. Parvin, V. Rezaie, and K. Bagherifard, [57] K. Fan, “On a theorem of Weyl concerning eigenvalues of linear “Clustering ensemble selection considering quality and diversity,” Artif. transformations. II,” Proc. Nat. Acad. Sci. USA, vol. 36, no. 1, pp. 31–35, Intell. Rev., vol. 52, no. 2, pp. 1311–1340, 2019. 1949. [31] A. Bagherinia, B. Minaei-Bidgoli, M. Hossinzadeh, and H. Parvin, “Elite [58] P. Zhou, L. Du, L. Shi, H. Wang, and Y.-D. Shen, “Recovery of corrupted Proc. 24th Int. Joint Conf. Artif. fuzzy clustering ensemble based on clustering diversity and quality multiple kernels for clustering,” in Intell. measures,” Appl. Intell., vol. 49, no. 5, pp. 1724–1747, 2019. , 2015, pp. 4105–4111. [59] X. Liu et al., “Optimal neighborhood kernel clustering with mul- [32] J. Azimi and X. Fern, “Adaptive cluster ensemble selection,” in Proc. tiple kernels,” in Proc. 31st AAAI Conf. Artif. Intell., 2017, 21st Int. Joint Conf. Artif. Intell., 2009, pp. 992–997. pp. 2266–2272. [33] Y. Hong, S. Kwong, H. Wang, and Q. Ren, “Resampling-based selec- [60] Y. Wang, X. Liu, Y. Dou, and R. Li, “Multiple kernel clustering tive clustering ensembles,” Pattern Recognit. Lett., vol. 30, no. 3, framework with improved kernels,” in Proc. 26th Int. Joint Conf. Artif. pp. 298–305, Feb. 2009. Intell., Aug. 2017, pp. 2999–3005. [34] H. Parvin and B. Minaei-Bidgoli, “A clustering ensemble framework [61] Z. Tao, H. Liu, S. Li, Z. Ding, and Y. Fu, “Robust spectral ensemble based on elite selection of weighted clusters,” Adv. Data Anal. Classifi- clustering via rank minimization,” ACM Trans. Knowl. Discovery from cation , vol. 7, no. 2, pp. 181–208, Jun. 2013. Data, vol. 13, no. 1, pp. 1–25, Jan. 2019. [35] H. Parvin and B. Minaei-Bidgoli, “A clustering ensemble framework [62] D. Huang, J.-H. Lai, and C.-D. Wang, “Robust ensemble clustering using based on selection of fuzzy weighted clusters in a locally adaptive probability trajectories,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 5, clustering algorithm,” Pattern Anal. Appl., vol. 18, no. 1, pp. 87–112, pp. 1312–1326, May 2016. Feb. 2015. [63] X. Liu et al., “Late fusion incomplete multi-view clustering,” IEEE [36] Z. Yu et al., “Hybrid clustering solution selection strategy,” Pattern Trans. Pattern Anal. Mach. Intell., vol. 41, no. 10, pp. 2410–2423, Recognit., vol. 47, no. 10, pp. 3362–3375, Oct. 2014. Oct. 2019. [37] X. Zhao, J. Liang, and C. Dang, “Clustering ensemble selection for [64] H. Wang, F. Nie, and H. Huang, “Globally and locally consistent categorical data based on internal validity indices,” Pattern Recognit., unsupervised projection,” in Proc. 28th AAAI Conf. Artif. Intell., 2014, vol. 69, pp. 150–168, Sep. 2017. pp. 1328–1333. [38] Y. Shi et al., “Transfer clustering ensemble selection,” IEEE [65] D. Cai, X. He, J. Han, and T. S. Huang, “Graph regularized nonnegative Trans. Cybern., early access, Dec. 25, 2018, doi: 10.1109/TCYB. matrix factorization for data representation,” IEEE Trans. Pattern Anal. 2018.2885585. Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011. [39] P. Zhou, L. Du, H. Wang, L. Shi, and Y. Shen, “Learning a robust con- [66] J. Li et al., “Feature selection: A data perspective,” ACM Comput. Surv., sensus matrix for clustering ensemble via Kullback-Leibler divergence vol. 50, no. 6, p. 94, 2018. minimization,” in Proc. IJCAI, 2015, pp. 4112–4118. [67] Y. Zhao and G. Karypis, “Empirical and theoretical comparisons of [40] Z. Tao, H. Liu, S. Li, and Y. Fu, “Robust spectral ensemble clustering,” selected criterion functions for document clustering,” Mach. Learn., in Proc. 25th ACM Int. Conf. Inf. Knowl. Manage. (CIKM), 2016, vol. 55, no. 3, pp. 311–331, Jun. 2004. pp. 367–376. [68] Z.-Q. Hong and J.-Y. Yang, “Optimal discriminant plane for a small [41] Z. Tao, H. Liu, S. Li, Z. Ding, and Y. Fu, “From ensemble clustering number of samples and design method of classifier on the plane,” Pattern to multi-view clustering,” in Proc. 26th Int. Joint Conf. Artif. Intell., Recognit., vol. 24, no. 4, pp. 317–324, Jan. 1991. Aug. 2017, pp. 2843–2849. [69] D. Cai, X. He, W. V. Zhang, and J. Han, “Regularized locality preserving [42] D. Huang, C.-D. Wang, H. Peng, J. Lai, and C.-K. Kwoh, “Enhanced indexing via spectral regression,” in Proc. 16th ACM Conf. Inf. Knowl. ensemble clustering via fast propagation of cluster-wise similarities,” Manage. (CIKM), 2007, pp. 741–750. IEEE Trans. Syst., Man, Cybern. Syst., early access, Nov. 5, 2018, [70] H. Wechsler, J. P. Phillips, V. Bruce, F. F. Soulié, and T. S. Huang, Face doi: 10.1109/TSMC.2018.2876202. Recognition: From Theory to Applications, vol. 163. Springer, 2012. [43] E. A. Ni and C. X. Ling, “Supervised learning with minimal effort,” [71] L. Du, X. Li, and Y.-D. Shen, “Cluster ensembles via weighted graph in Proc. Pacific–Asia Conf. Knowl. Discovery Data Mining. Hyderabad, regularized nonnegative matrix factorization,” in Proc. Int. Conf. Adv. India: Springer, 2010, pp. 476–487. Data Mining Appl. Beijing, China: Springer, 2011, pp. 215–228. [44] S. Basu and J. Christensen, “Teaching classification boundaries to [72] D. Huang, C.-D. Wang, and J.-H. Lai, “Locally weighted ensem- humans,” in Proc. AAAI, 2013, pp. 109–115. ble clustering,” IEEE Trans. Cybern., vol. 48, no. 5, pp. 1460–1473, [45] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent May 2018. variable models,” in Proc. NIPS, 2010, pp. 1189–1197. [73] J. Zhou, H. Zheng, and L. Pan, “Ensemble clustering based [46] L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan, and A. G. Hauptmann, “Self- on dense representation,” Neurocomputing, vol. 357, pp. 66–76, paced learning with diversity,” in Proc. NIPS, 2014, pp. 2078–2086. Sep. 2019.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHOU et al.: SPCE 15

Peng Zhou received the B.E. degree in computer Yi-Dong Shen was a with Chongqing science and technology from the University of University, Chongqing, China. He is currently a Science and Technology of China, Hefei, China, Professor of computer science with the State Key in 2011, and the Ph.D. degree in computer science Laboratory of Computer Science, Institute of Soft- from the Institute of Software, University of Chinese ware, Chinese Academy of Sciences, Beijing, China. Academy of Sciences, Beijing, China, in 2017. His main research interests include knowledge rep- He is currently a Lecturer with the School of resentation and reasoning, semantic web, and data Computer Science and Technology, Anhui Univer- mining. sity, Hefei. His research interests include machine learning, data mining, and artificial intelligence.

Liang Du (Member, IEEE) received the B.E. degree in software engineering from Wuhan University, Wuhan, China, in 2007, and the Ph.D. degree in computer science from the Institute of Soft- ware, University of Chinese Academy of Sciences, Beijing, China, in 2013. Mingyu Fan received the B.Sc. degree in informa- He was a Software Engineer with Alibaba Group, tion and computing science from the Minzu Univer- , China, from July 2013 to July 2014. sity of China, Beijing, China, in 2006, and the Ph.D. He is currently a Lecturer with Shanxi Univer- degree in applied mathematics from the Academy of sity, Taiyuan, China. His research interests include Mathematics and Systems Science, Chinese Acad- machine learning, data mining, and big data analysis. emy of Sciences, Beijing, in 2011. He is currently a Professor with Wenzhou Uni- Xinwang Liu received the Ph.D. degree from versity, Wenzhou, China. His current research inter- the National University of Defense Technology ests include data mining and pattern recognition (NUDT), Changsha, China. algorithms and the applications of them on solving He is currently a Professor with the School industrial problems. of Computer, NUDT. He has published more than 60 peer-reviewed articles, including those in highly regarded journals and conferences, such as the IEEE TRANSACTIONS ON PATTERN ANALY- SIS AND MACHINE INTELLIGENCE (T-PAMI), the IEEE TRANSACTIONSON KNOWLEDGE AND DATA Xuejun Li received the Ph.D. degree from Anhui ENGINEERING (T-KDE), the IEEE TRANSACTIONS University, Hefei, China, in 2008. ON IMAGE PROCESSING (T-IP), the IEEE TRANSACTIONSON NEURAL NET- He is currently a Professor with the School of WORKS AND LEARNING SYSTEMS (T-NNLS), the IEEE TRANSACTIONS Computer Science and Technology, Anhui Univer- ON MULTIMEDIA (T-MM), the IEEE TRANSACTIONS ON INFORMATION sity. His major research interests include workflow FORENSICS AND SECURITY (T-IFS), NeurIPS, the International Conference systems, cloud computing, and intelligent software. on Computer Vision (ICCV), the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), the AAAI Conference on Artificial Intelligence (AAAI), and the International Joint Conference on Artificial Intelligence (IJCAI). His current research interests include kernel learning and unsuper- vised feature learning.

Authorized licensed use limited to: Anhui University. Downloaded on April 22,2020 at 02:09:04 UTC from IEEE Xplore. Restrictions apply.