DISCRIMINATIVELY CONSTRAINED SEMI-SUPERVISED MULTI-VIEW NONNEGATIVE MATRIX FACTORIZATION WITH GRAPH REGULARIZATION

Guosheng Cui, Ruxin Wang, Dan Wu, and Ye Li†

Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.

ABSTRACT information into one compact representation, the key prob- In recent years, semi-supervised multi-view nonnegative lem is to design an efficient fusion strategy. MultiNMF [6] matrix factorization (MVNMF) algorithms have achieved trys to learn a consensus representation with a centroid co- promising performances for multi-view clustering. While regularization term. In [7], a MVNMF approach based on most of semi-supervised MVNMFs have failed to effec- pair-wise co-regularization is proposed. With pair-wise co- tively consider discriminative information among clusters regularization, the features from different views are pushed and feature alignment from multiple views simultaneously. together and alignment is acquired. In [8], a MVNMF method In this paper, a novel Discriminatively Constrained Semi- based on nonnegative matrix tri-factorization [9] is designed. Supervised Multi-View Nonnegative Matrix Factorization This method factorizes each view into three matrices and (DCS2MVNMF) is proposed. Specifically, a discriminative constrain the columns of product matrix of basis matrix and weighting matrix is introduced for the auxiliary matrix of shared embedding matrix to be unit vector. A centroid co- each view, which enhances the inter-class distinction. Mean- regularization is used to align the multiple views. Above while, a new graph regularization is constructed with the mentioned methods all try to fuse multiple views in unsuper- label and geometrical information. In addition, we design a vised learning manner. new feature scale normalization strategy to align the multiple Beyond that, some semi-supervised MVNMF are pro- views and complete the corresponding iterative optimization posed and promising results are obtained with label informa- schemes. Extensive experiments conducted on several real tion [10–12]. Wang et al. presented the AMVNMF method, world multi-view datasets have demonstrated the effective- which has extended constrained nonnegative matrix factor- ness of the proposed method. ization (CNMF) [13] to handle multi-view clustering task Index Terms— Multi-view, semi-supervised clustering, for the first time [14]. Specifically, an auto-weighting strat- discriminative information, geometry information, feature egy is adopted to balance different views, and a centroid normalizing strategy co-regularization is used to align the multiple views. In [15], the proposed MVCNMF approach integrated the sparseness 1. INTRODUCTION constraint and aligned multiple views with Euclidean dis- tance based pair-wise co-regularization like [7]. Overall, how With the rapid development of the information technology, to align multiple views and enhance the discrimination of the data is not only with high dimension but also with multi- inter-class features play an important role in feature learning. arXiv:2010.13297v1 [cs.LG] 26 Oct 2020 ple modalities or views. These different views generally con- Addressing these issues, in this paper, we present a tain complementary and interaction information, which can novel discriminatively constrained semi-supervised MVNMF be aggregated and boost the performance for specific task in (DCS2MVNMF), which enlarges the inter-class diversities real-world applications. Consequently, the key issue is how to and shrinks the intra-class variation simultaneously. In this integrate the information came from multiple views to gener- work, a discriminative weighting matrix is designed and ate a powerful representation effectively in the learning pro- imposed on the auxiliary matrix to enforce the inter-class dis- cedure [1, 2]. tinctness. A new graph regularization is incorporated into the In recent years, the multi-view clustering method has objective function which has access to the geometrical infor- obtained significant attention and application. Many non- mation of the multi-view data in each view for decreasing the negative matrix factorization (NMF) based approaches for intra-class diversities. To align multiple views effectively, we multi-view clustering have been proposed and achieved some restrict the column vectors of basis matrix in each view as an breakthroughs [3–5]. In multi-view nonnegative matrix fac- unit vector, and a centroid co-regularization is adopted. An torization (MVNMF) framework, to integrate multiple view iterative optimizing scheme is also designed for the proposed †Corresponding author: [email protected] method. 2. METHODOLOGY

2.1. Reconstruction Loss In each view, we consider applying the CNMF-like recon- struction method following as [11, 14, 15]. Additionally, to align the multiple views, a centroid co-regularization is imposed on Zv. Therefore, the reconstruction term of our v v v DCS2MVNMF can be written as follows: Fig. 1. X , Z and AlcZ of (a). AMVNMF [14], MVCNMF [15] and MVOCNMF [11]; (b). DCS2MVNMF. The samples nv X v v (||Xv − Wv(A Zv)T ||2 + γ||Zv − Z ||2 ), (1) with same labels, such as x1 and x2, are mapped into the same lc F c F v v v v=1 vectors, i.e., h1 = h2 = z1 .

v v where Z ∈ R(n−l+c)×d is the auxiliary matrix for vth view. Zc indicates the consensus auxiliary matrix. n, l and c are the number of data points, the number of labeled data points and 2.3. Geometrical Prior v the number of classes of the datasets respectively. d is the To further utilize the geometrical information of the multi- dimension of vth view. Alc is a label constraint matrix, which view, a new graph regularization using the labeled samples is as follows: and unlabeled samples is defined as follows:

  nv n n nv Cl×c 0 P P P v v 2 v P vT v v Alc = , (2) ||hi − hj ||2Sij = tr(H L H ) 0 In−l v=1 i=1 j=1 v=1 nv (5) n − l P v T v v where In−l is an identity matrix whose dimension is . C = tr((AlcZ ) L AlcZ ), is a label index matrix in which each row is a one-hot vector v=1 and the 1 corresponds to the real category. v v v v v v P v v where H = AlcZ , L = D − S , Dii = j Sij (or Dii = v v 2 2 P v v −||hi −hj ||2/2δ v j Sji). Sij = e , if hi is one of k-nearest- 2.2. Discriminative Prior v v v neighbors of hj or hj is one of k-nearest-neighbors of hi . δ Only factorizing each view in CNMF framework, samples is a predefined parameter. with same label are mapped to same feature vectors, which fails to consider discriminative information provided by the 2.4. Feature Normalization labeled samples and the inter-class distinctness is not guaran- v teed for the learned representations. (see Figure 1). To enable To align multiple features effectively, the scales of Z s must the discriminative ability of DCS2MVNMF, we impose fol- be restricted to be comparable. For this target, we restrict v v lowing constraint on the auxiliary matrix Zv: the column vectors of W as ||W·j||2 = 1 [16]. However, directly optimizing above objective function makes the opti- n Xv mization problem very complex. To tackle this problem, an ||I Zv||2 , (3) disc F alternative scheme is to compensate the norms of the basis v=1 matrix into the coefficient matrix. Then, above equation can ˆ where is the element-wise product operator, Idisc = [ˆI; 0] ∈ be further rewritten as: (n−l+c)×dv R is a discriminative matrix to ensure the distinc- nv v P v v v T 2 ˆ (n−l)×d min (||X − W (AlcZ ) || tion of the features from different classes. 0 ∈ R v v v F W ,Z ,Zc≥0,||W·j ||2=1 v=1 is an all zero matrix corresponding to the unlabeled samples. T T v v v 2 v v v v v ˆ c×d +α||Idisc Z Q ||F + βtr(Q (AlcZ ) L AlcZ Q ) Specifically, I ∈ R is defined as follows: v v 2 +γ||Z Q − Zc||F ),  0¯ 1¯ ··· 1¯  (6) v  1¯ 0¯ ··· 1¯  where Q is defined as ˆI =   , (4)  . . .. .  s s s  . . . .  mv mv mv v P v 2 P v 2 P v 2 1¯ 1¯ ··· 0¯ Q =Diag( Wi,1 , Wi,2 , ..., Wi,dv ). i=1 i=1 i=1 ms ms (7) z }| { z }| { where 0¯ = [0, ..., 0] and 1¯ = [1, ..., 1]. ms is the dimensional- ity of each subspace. In this paper, we set ms = 1 like most 3. OPTIMIZATION literature. By minimizing Eq. (3), the entries off the diago- nal is suppressed, so the inter-class distinction is ensured (see For a specific view v, its optimization with Wv and Zv does Figure 1 (b)). not depend on other views. The problem of minimizing Eq. (6) can be written as follow: 4. EXPERIMENTS

v v v T 2 v v 2 min ||X − W (AlcZ ) || +α||Idisc Z Q || v v F F 4.1. Materials W ,Z ,Zc≥0 vT v T v v v v v 2 +βtr(Q (AlcZ ) L AlcZ Q )+γ||Z Q −Zc||F Yale [17]: The dataset contains 165 gray scale images col- (8) v v lected from 15 people. ORL [17]: This dataset has 400 gray Fixing Zc, updating W and Z v v scale face images collected from 40 people. For these two 1) Fixing Zc and Z , updating W v datasets, the image intensity, Gabor and LBP are used as three When Zc and Z are fixed, above equation is equivalent views of the datasets. ECG 1: This dataset has 162 original to the following problem: ECG records including 96 arrhythmia, 30 congestive heart v v v T 2 failure, and 36 normal sinus rhythm records. Each 20 sec- min ||X − W (AlcZ ) ||F Wv ≥0 onds are treated as a data sample. In total, 294 data sam- vT v T v v +αtr(Q (Idisc Z ) (Idisc Z )Q ) (9) ples are used for evaluation. The frequency-domain and time- vT v T v v v +βtr(Q (AlcZ ) L AlcZ Q ) domain features are employed as two views. WebKB [18]: vT vT v v T v v +γtr(Q Z Z Q − 2Zc Z Q ) This dataset is a subset of web documents from four univer- sities, which contains 1051 pages with 2 classes: 230 Course Let pages and 821 Non-Course pages. Each page has 2 views: v v T v Y1 = [(Idisc Z ) (Idisc Z )] I Fulltext view and Inlinks view. This dataset is balanced by v v T v v v+ v− Y2 = [(AlcZ ) L AlcZ ] I = Y2 − Y2 selecting first 241 data points from the second view. v v T v (10) Y3 = [(Z ) Z ] I v T v Y4 = Zc Z I, 4.2. Comparative Algorithms v+ v− where I is the identity matrix and the matrices Y2 and Y2 To evaluate the effectiveness of the proposed method, several are defined as state-of-the-art NMF-based multi-view clustering methods v+ v T v v Y2 = [(AlcZ ) D AlcZ ] I are selected, including unsupervised methods (VAGNMF, v− v T v v (11) Y2 = [(AlcZ ) S AlcZ ] I. VCGNMF, LP-DiNMF [3], rNNMF [19], MPMNMF [7] and UDNMF [8]) and three semi-supervised approaches Using the Lagrange Multiplier method, we can get fol- (AMVNMF [14], MVCNMF [15] and MVOCNMF [11]). lowing update rule: Specifically, the VAGNMF and the VCGNMF are conducted v v v v− v v (X AlcZ +βW Y2 employing GNMF [20] on each view individually, and use the Wih = Wih v v T v v v (W (AlcZ ) AlcZ +αW Y1 v v −1 v (12) average feature and concatenated feature of multiple views as +γW (Q ) Y4 )ih v v+ v v . the final representation, respectively. +βW Y2 +γW Y3 )ih v v To avoid errors from randomness, all methods are tested 2) Fixing Zc and W , updating Z v v 10 times, and the average value is reported. For semi- After the updating of W , the column vectors of W are supervised methods, 10%, 20% and 30% of data points are normalized with Qv in Eq. (7) and then the norm is conveyed v randomly labeled for 5 times. And average result of 5 times to the coefficient matrix Z , that is: are reported. The default parameter setting demonstrated in Wv ⇐ Wv(Qv)−1, Zv ⇐ ZvQv. (13) Table 1. Two widely used evaluating metrics AC and NMI are adopted to measure the clustering performance. v When Zc and W are fixed, Eq. (8) is equivalent to the following problem: Table 1. Default parameter setting of DCS2MVNMF. The v v v T 2 values in brackets correspond to 10%, 20% and 30% labeled min ||X − W (AlcZ ) ||F Zv ≥0 data points. v T v +αtr((Idisc Z ) (Idisc Z )) (14) Yale ORL ECG WebKB v T v v α (103,104,102) (102,103,103) (105,104,103) (103,103,103) +βtr((AlcZ ) L AlcZ ) vT v T v β (0.1,0.1,1) (1,1,0.1) (10,1,10) (1,1,1) +γtr(Z Z − 2Zc Z ). γ (0.1,0.1,0.1) (0.01,0.01,0.01) (0.1,0.1,0.1) (1,1,1) v k (2,2,2) (2,2,2) (4,4,4) (3,3,3) Similarly, the update rules of Zjh can be written as fol- lowing:

T v T v T v v v v (Alc X W +βAlc S AlcZ Zjh = Zjh T v v T v v 4.3. Comparison Results (Alc AlcZ W W +αIdisc Z (15) +γZc)jh T v v v . +βAlc D AlcZ +γZ )jh Table 2 shows the comparison results of some recently pro- v v 2 Fixing W and Z , updating Zc posed unsupervised MVNMFs and our DCS MVNMF with nv P v Z 1 v=1 https://www.physionet.org/ Correspondingly, the exact solution for Zc is ≥ 0. nv Table 2. Clustering performance on four datasets compared with state-of-the-art unsupervised methods. (note: MPMNMF 1 and MPMNMF 2 adopt the Euclidean distance based and kernel based pair-wise co-regularization, respectively.) Datasets Metrics VAGNMF VCGNMF LP-DiNMF rNNMF MPMNMF 1 MPMNMF 2 UDNMF DCS2MVNMF AC 48.55±2.95 53.52±3.14 51.82±3.03 51.15±5.49 50.00±3.05 49.52±3.67 47.76±2.83 61.03±3.57 Yale NMI 52.38±2.70 56.79±2.27 53.73±2.67 53.00±4.40 52.89±2.78 54.00±2.57 49.88±2.73 63.35±2.18 AC 68.78±3.17 70.28±2.26 68.48±3.33 64.35±2.74 70.25±2.87 69.48±2.89 61.73±1.56 76.86±1.46 ORL NMI 82.88±1.68 83.67±1.01 82.78±1.80 79.79±1.31 83.65±0.99 83.64±1.31 78.54±1.03 87.09±0.58 AC 55.41±4.58 56.43±2.93 54.90±5.69 57.38±0.45 57.86±3.60 58.20±3.71 56.84±3.11 70.63±6.19 ECG NMI 22.82±4.34 26.36±4.43 22.52±7.80 23.80±1.24 19.72±4.60 20.41±4.42 22.28±3.25 40.52±6.79 AC 77.02±1.24 79.57±0.70 76.49±0.79 78.30±1.17 81.11±1.77 78.98±2.10 79.34±4.80 92.09±1.46 WebKB NMI 26.84±2.28 29.42±1.42 26.14±1.56 24.63±5.85 34.22±2.46 30.99±5.15 29.98±9.00 63.64±5.18

Table 3. Clustering performance on four datasets compared with state-of-the-art semi-supervised methods. AMVNMF MVCNMF MVOCNMF DCS2MVNMF AC NMI AC NMI AC NMI AC NMI 10% 50.40±0.89 53.51±0.75 50.99±1.53 54.54±1.59 51.33±1.62 54.61±1.10 61.03±3.57 63.35±2.18 Yale 20% 55.38±1.97 59.08±2.04 57.68±1.12 61.28±1.24 57.84±1.44 61.34±0.90 70.88±3.31 69.02±2.47 30% 61.56±1.70 65.45±1.68 62.63±1.56 67.00±1.02 63.70±2.63 67.57±1.40 77.50±2.06 74.92±2.43 10% 63.15±0.65 79.16±0.51 66.48±0.80 81.41±0.40 66.77±1.29 81.22±0.56 76.86±1.46 87.09±0.58 ORL 20% 68.41±0.54 82.37±0.29 70.18±0.85 83.76±0.28 70.27±0.55 83.81±0.40 85.69±3.02 91.58±1.38 30% 71.99±1.20 84.51±0.42 74.02±1.41 85.98±0.50 75.27±0.54 86.55±0.13 89.13±1.47 92.68±1.03 10% 50.90±0.96 13.90±1.19 53.78±0.56 26.23±0.87 58.69±1.15 26.31±1.39 70.63±6.19 40.52±6.79 ECG 20% 54.81±2.34 18.93±1.51 57.50±1.57 28.86±0.89 63.74±1.84 25.51±1.58 82.26±3.25 56.08±6.23 30% 55.67±3.63 19.20±3.54 66.22±1.49 31.46±1.24 70.29±2.65 36.43±2.16 84.97±1.24 58.45±2.62 10% 82.44±1.54 33.26±3.55 82.73±2.39 38.54±4.39 83.29±1.96 39.89±3.84 92.09±1.46 63.64±5.18 WebKB 20% 85.90±1.94 41.58±5.12 84.44±2.05 43.13±3.01 81.87±2.61 39.35±3.54 92.35±2.58 62.12±8.44 30% 90.17±1.00 53.84±3.19 91.06±0.73 59.09±1.66 90.45±0.68 57.14±1.58 95.03±1.68 71.92±7.23

4.4. Parameter Analysis We also conduct the parameter analysis on ORL and WebKB datasets. Figure 2 shows the performance of semi-supervised DCS2MVNMF on ORL and WebKB datasets, with different ratios of labeled data points (10%, 20%, 30% respectively), versus parameters α, β and γ. We can see that when α < 102 and β > 100 with γ ≤ 101 or γ ≤ 100, the performance of the proposed method is better. Fig. 2. DCS2MVNMF with different parameters α, β and γ on ORL (first row) and WebKB (second row) datasets. 4.5. Ablation Analysis Table 4 shows the ablation results of our method on Yale Table 4. Ablation study on Yale, ORL datasets with 10% and ORL datasets. In the ablation study, the reconstruction labeled data points. term with consensus align term are choose as the “Base- Yale ORL AC NMI AC NMI line”. From the table, it shows that the discriminative prior Baseline 46.16 50.40 60.13 76.70 (“Baseline+α”) and the geometrical prior (“Baseline+β”) Baseline+α 56.76 59.02 71.14 82.61 effectively boost the performance of clustering with consid- Baseline+β 50.10 53.56 67.70 83.19 erable improvement on these two datasets. In addition, it also DCS2MVNMF w/o Qv 59.25 61.75 70.31 83.79 DCS2MVNMF 61.03 63.35 76.86 87.09 suggests that the proposed feature normalization is beneficial to multiple feature alignment and simplification optimization.

10% labels on four datasets. From this table, with the utiliza- 5. CONCLUSION tion of label information the clustering performance is effec- tively improved. Table 3 lists the similar comparison results In this paper, a novel semi-supervised MVNMF with discrim- with three CNMF based semi-supervised MVNMFs using inative and geometrical prior is proposed, which enlarges the different labeling ratios. It demonstrates that the discrimina- inter-class diversities and shrinks the intra-class variation si- tive information is effectively explored in our DCS2MVNMF. multaneously. In order to effectively align the features of mul- And the introducing of discriminative prior and graph regu- tiple views, a new feature normalizing scheme is adopted to larization can help to enhance the inter-class distinction and restrict the feature scales of different views. Experiments on shrink the intra-class variation simultaneously. four datasets have verified the effectiveness of our method. 6. REFERENCES [12] Naiyao Liang, Zuyuan Yang, Zhenni Li, Shengli Xie, and Chun-Yi Su, “Semi-supervised multi-view cluster- [1] Peng Luo, Jinye Peng, Ziyu Guan, and Jianping Fan, ing with graph-regularized partially shared non-negative “Dual regularized multi-view non-negative matrix fac- matrix factorization,” Knowledge-Based Systems, vol. torization for clustering,” Neurocomputing, vol. 294, pp. 190, pp. 105185, 2020. 1–11, 2018. [13] Haifeng Liu, Zhaohui Wu, Xuelong Li, Deng Cai, and [2] Zheng Zhang, Li Liu, Fumin Shen, Heng Tao Shen, and Thomas S , “Constrained nonnegative matrix fac- Ling Shao, “Binary multi-view clustering,” IEEE Trans- torization for image representation,” IEEE Transactions actions on Pattern Analysis And Machine Intelligence, on Pattern Analysis & Machine Intelligence, vol. 34, pp. vol. 41, no. 7, pp. 1774–1782, 2018. 1299–1311, Jul. 2012. [3] Jing Wang, Feng Tian, Hongchuan Yu, Chang Hong [14] Jing Wang, Xiao Wang, Feng Tian, Chang Hong Liu, Liu, Kun Zhan, and Xiao Wang, “Diverse non-negative Hongchuan Yu, and Yanbei Liu, “Adaptive multi-view matrix factorization for multiview data representation,” semi-supervised nonnegative matrix factorization,” in IEEE Transactions on Cybernetics, vol. 48, no. 9, pp. International Conference on Neural Information Pro- 2620–2632, 2017. cessing. Springer, 2016, pp. 435–444. [4] Handong Zhao and Zhengming Ding, “Multi-view clus- [15] Hao Cai, Bo Liu, Yanshan Xiao, and LuYue Lin, “Semi- tering via deep matrix factorization,” in The Thirty-First supervised multi-view clustering based on constrained AAAI Conference on Artificial Intelligence, 2017. nonnegative matrix factorization,” Knowledge-Based Systems, vol. 182, pp. 104798, 2019. [5] Jianqiang Li, Guoxu Zhou, Yuning Qiu, Yanjiao Wang, Yu Zhang, and Shengli Xie, “Deep graph regularized [16] Jianchao Yang, Shuicheng Yang, Yun Fu, Xuelong Li, non-negative matrix factorization for multi-view cluster- and Thomas Huang, “Non-negative graph embedding,” ing,” Neurocomputing, vol. 390, pp. 108–116, 2020. in IEEE Conference on and , Anchorage, Alaska, Jun. 2008, pp. 1–8. [6] Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han, “Multi- view clustering via joint nonnegative matrix factoriza- [17] Deng Cai, Xiaofei He, Jiawei Han, and Hong-Jiang tion,” in Proceedings of the 2013 SIAM International Zhang, “Orthogonal laplacianfaces for face recogni- Conference on Data Mining. SIAM, 2013, pp. 252–260. tion,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3608–3614, 2006. [7] Xiumei Wang, Tianzhen Zhang, and Xinbo Gao, “Mul- tiview clustering based on non-negative matrix factor- [18] Wenzhang Zhuge, Chenping Hou, Yuanyuan Jiao, Jia ization and pairwise measurements,” IEEE Transactions Yue, Hong Tao, and Dongyun Yi, “Robust auto- on Cybernetics, vol. 49, no. 9, pp. 3333–3346, 2018. weighted multi-view subspace clustering with common subspace representation matrix,” PloS one, vol. 12, no. [8] Zuyuan Yang, Naiyao Liang, Wei Yan, Zhenni Li, and 5, pp. e0176769, 2017. Shengli Xie, “Uniform distribution non-negative matrix factorization for multiview clustering,” IEEE Transac- [19] Feiqiong Chen, Guopeng Li, Shuaihui Wang, and tions on Cybernetics, 2020. Zhisong Pan, “Multiview clustering via robust neighbor- ing constraint nonnegative matrix factorization,” Math- [9] Zuyuan Yang, Yong Xiang, Kan Xie, and Yue Lai, ematical Problems in Engineering, vol. 2019, 2019. “Adaptive method for nonsmooth nonnegative matrix factorization,” IEEE Transactions on Neural Networks [20] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S and Learning Systems, vol. 28, no. 4, pp. 948–960, Huang, “Graph regularized nonnegative matrix factor- 2016. ization for data representation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 33, no. 8, [10] Jing Liu, Yu Jiang, Zechao Li, Zhi-Hua Zhou, and Han- pp. 1548–1560, Aug. 2011. qing Lu, “Partially shared latent factor learning with multiview data,” IEEE Transactions on Neural Net- works and Learning Systems, vol. 26, no. 6, pp. 1233– 1246, 2014. [11] Hao Cai, Bo Liu, Yanshan Xiao, and LuYue Lin, “Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factor- ization,” Information Sciences, 2020.