IEICE TRANS. INF. & SYST., VOL.E104–D, NO.5 MAY 2021 776

LETTER Deep Clustering for Improved Inter-Cluster Separability and Intra-Cluster Homogeneity with Cohesive Loss

Byeonghak KIM†, Murray LOEW††a), David K. HAN†††b), Nonmembers, and Hanseok KO††††c), Member

SUMMARY To date, many studies have employed clustering for the be used for data representation based on has classification of unlabeled data. Deep separate clustering applies several been proposed. deep learning models to conventional clustering algorithms to more clearly An is an unsupervised feature ex- separate the distribution of the clusters. In this paper, we employ a convo- lutional autoencoder to learn the features of input images. Following this, traction tool used in deep clustering. Representative k-means clustering is conducted using the encoded layer features learned autoencoder-based deep clustering algorithms include auto- by the convolutional autoencoder. A center loss function is then added to encoder-based data clustering (ABDC) [8], deep embed- aggregate the data points into clusters to increase the intra-cluster homo- ded clustering (DEC) [9], improved deep embedded clus- geneity. Finally, we calculate and increase the inter-cluster separability. We combine all loss functions into a single global objective function. Our tering (IDEC) [10], discriminatively boosted clustering new deep clustering method surpasses the performance of existing cluster- (DBC) [11], and deep embedded regularized clustering ing approaches when compared in experiments under the same conditions. (DEPICT) [12]. They are all based on the idea that the neu- key words: separate clustering, convolutional autoencoder, intra-cluster ral network learns the features that are suitable for cluster- homogeneity, inter-cluster separability ing. In addition, deep embedded clustering with data aug- mentation (DEC-DA) [13] employs random rotation, crop- 1. Introduction ping, shearing, and shifting on the data to generalize the model. Yet, these techniques do not exhibit strong inter- In , data classification is a particularly cluster separability nor robust intra-cluster homogeneity, important task. However, individually labeling data points leaving a room for improvements. ff requires significant time and e ort, and it is often impossi- In this paper, we propose a new clustering algorithm ble to fully label datasets for research applications. To over- with the ability to separate clusters more effectively than come this problem, clustering via has existing deep clustering algorithms by increasing the inter- ff been proposed and is widely utilized. Clustering e ectively cluster separability and the intra-cluster homogeneity. Our groups unlabeled data based on specific criteria of similar- deep separate clustering algorithm focuses on making the ity and can automatically extract semantic information that scattered samples characterized by Gaussian distribution humans cannot abstract. more cohesive between very similar data but, when the sim- Many data-mining researchers have investigated vari- ilarity is low, it separates the data points belonging to more ous types of clustering. Both hard and soft clustering is pos- corresponding clusters. This process can also be effective sible depending on whether an observation point belongs when processing cluster-overlapping data. Our proposed to one or to multiple clusters. Hard clustering includes k- method greatly improves on the performance of existing means clustering [1], k-medoid clustering [2], density-based deep clustering algorithms when tested on public datasets. spatial clustering of applications with noise (DBSCAN) [3], In Sect. 2, the proposed method is described. The pro- [4], random binary pattern of patch cess and results of the experiment are presented in Sect. 3. clustering (RBPPC) [5]. Soft clustering includes Gaussian Finally, Sect. 4 provides the conclusions. mixture model-based clustering [6] and fuzzy clustering [7]. Recently, a deep clustering method that extracts features to 2. Proposed Deep Clustering with Cohesive Loss Manuscript received October 26, 2020. Manuscript revised December 31, 2020. In this section, we propose an innovative deep clustering al- Manuscript publicized January 28, 2021. † gorithm based on the separation between clusters using four The author is with the Dept. of Visual Information Processing, loss functions for global optimization. First, deep features Korea University, Seoul, 02841, Korea. ††The author is with the Dept. of Biomedical Engineering, are learned while reconstructing unlabeled data with a con- George Washington University, Washington DC, USA. volutional autoencoder (CAE). Upon the completion of pre- †††The author is with the Dept. of Electrical and Computer En- training of the autoencoder, the second stage is focused on gineering, Drexel University, Philadelphia, PA USA. clustering of deep features using the distances between the †††† The author is with the School of Electrical Engineering, Korea features assigned to the same cluster and the distances be- University, Seoul, 02841, Korea. tween data points assigned to different clusters in an end-to- a) E-mail: [email protected] b) E-mail: [email protected] end manner. c) E-mail: [email protected] (Corresponding author) DOI: 10.1587/transinf.2020EDL8138

Copyright c 2021 The Institute of Electronics, Information and Communication Engineers LETTER 777

+  − µ 2 −1/ (1 zi j ) q =  (3) ij +  − µ 2 −1 j (1 zi j )

The target distribution pij can be expressed based on qij:  QT q =   i ij  pij   (4) T Fig. 1 The structure of the feature extraction process with a convolu- j Q i qij tional autoencoder for the Fashion-MNIST dataset. The size of the embed- ded features is smaller than that of input X. The learned features can be where Q = (eqij − 1)/(e − 1) and T > 0. We set this T value used for clustering. T to 3. The range of Q is 0 to 1 because qij has a value from 0 to 1. In Eq. (4), the exponential function constructs p by adding nonlinearity to q. Therefore, the target distribution p enhances the prediction by giving more emphasis to the 2.1 Feature Extraction with a Convolutional Autoencoder cluster assignments with high probability in q. In addition, the loss is regularized to prevent distortion of the entire fea- An autoencoder is a deep learning approach that learns the ture space by different contributions to the loss depending features of unlabeled data and that is widely employed as a  on the density of the cluster, q . Finally, the clustering feature extractor. The structure is shown in Fig. 1. i ij loss function is In training set X = {x ∈ RD}m , x denotes the i-th i i=1 i   training data point from m data points, and D is the dimen- pij L = D (P  Q) = p log (5) sion of x . The autoencoder loss function is c KL ij q i i j ij L = X − X2 (1) ae 2.3 The Proposed Discriminative Cluster Loss Function In addition, because the autoencoder learns features in order to reconstruct input X as accurately as possible us- The final goal of clustering is to assign data points with a ing output X through the encoder and decoder, autoencoder strong similarity to the same cluster with a certain similar- loss is also referred to as reconstruction loss. Equation (1) ity measurement. Therefore, to achieve robust clustering, can thus be expressed as follows: both the intra-cluster homogeneity and the inter-cluster sep-  arability should be increased. In this study, LW is used to 1 m  2 minimize the intra-cluster distance and L reduces the inter- Lr = xi − x  (2) B i=1 i 2 m cluster proximity. LW is a function that represents the vari- The embedded features learned by Eq. (2) become the ations between the cluster centroids and the inner cluster input for the subsequent clustering algorithm described in points, as in [16]. LW increases the homogeneity of the inner Sects. 2.2 and 2.3. cluster points by making the data points in the same cluster gather around the centroid. LW is represented by (6): 2.2 Loss Function for Deep Embedded Clustering  1 m 2 LW = zi − µ  (6) 2 i=1 yi 2 Autoencoder-based clustering is also known as deep embed- ded clustering. The main idea of deep embedded cluster- where yi is the predicted cluster label for the i-th sample and µ ∈ Rd µ ing algorithms is to perform clustering the features obtained yi is the centroid of the yi-th cluster. yi can be up- from . As mentioned earlier, [13] and also im- dated for an iterative training process. LB is a function cal- prove model generalization using data augmentation tech- culated from the inter-cluster cosine distance. Thus, smaller niques. For this reason, we employ the data augmentation. LB means larger inter-cluster distances. LB can be derived The features in deep embedded clustering generally as follows:   have smaller dimensions than the input and output data. n−1 n−1 µ · µ 1 1 j k In this paper, we perform clustering on embedded fea- LB = · · log ReLU +1+ d m µ  µ  = { ∈ R } 2 nC2 j 2 k 2 tures Z zi i=1, where d is the embedded fea- j=0 k=0 ture size and m is the number of data points. The clus- (7) tering loss function is calculated by applying the Student’s µ µ t-distribution [14] results obtained by performing soft deep where n denotes the number of clusters and j and k are the clustering using the network in Sect. 2.1 to Kullback-Leibler j-th and k-th centroids of the embedded features from the divergence (KLD) [15]. CAE, respectively. nC2 represents the number of all combi- In more detail, k-means clustering is performed on Z nations when pairing two in n clusters. = {µ ∈ Rd}n µ to M j j=1, where j is the j-th element of n An ReLU activation function is used to prevent a nega- centroids. The output distribution qij resulting from cluster- tive value for the centroids of all clusters in combination; as ing with Z and M can be defined as follows based on the such, 1 and prevent the output of the log function from Student’s t-distribution: becoming negative or zero, respectively. Because LB is IEICE TRANS. INF. & SYST., VOL.E104–D, NO.5 MAY 2021 778

significantly larger than our other loss functions, it is log- transformed for normalization. Finally, the global loss function in Eq. (8) is created by combining the loss functions in Eqs. (2), (5), (6), and (7):

L = α · Lr + β · Lc + γ · LW + ω · LB (8) where α, β, γ, and ω are the heuristically determined loss weight parameters. Competition among these losses may cause oscillations in the learning process, however com- bining them into a global loss and careful selection of the weight parameters ensure stable and effective learning. The process of the proposed method is summarized in Algorithm 1. First, the initial zi is obtained through pre- training the CAE on the public database. Second, k-means µ clustering is performed to obtain initial j. Then, qij and pij are initialized by Eqs. (3) and (4). After that, the network is updated repeatedly until the appropriate stopping criterion is satisfied. At this time, the model’s global loss function is optimized by a stochastic gradient descent algorithm.

3. Experiments

To verify our proposed method, we compare the perfor- mance of many types of prominent clustering algorithms for four image datasets. As a result of the experiment, our proposed method exhibits superior performance in terms of clustering accuracy (ACC) and normalized mutual infor- mation (NMI) when compared to the state-of-art clustering algorithms. Fig. 2 If the inter-cluster distance increases, the distribution of the data points changes and the degree of separation of the clusters increases. The 3.1 Databases left side shows the case with small inter-cluster while the right side shows a large separability between the clusters (i.e., the distance is suitably large).  MNIST: A total 70,000 handwritten digit images (60,000 (a), (c), (e), (g) represents distributions prior to clustering while (b), (d), (f), × (h) shows distributions after clustering of MNIST-full, MNIST-test, USPS, training images and 10,000 test images), 10 classes, 28 28 and Fashion-MNIST datasets, respectively. grayscale.  MNIST-test: A total of 10,000 handwritten digit images, 10 classes, 28 × 28 grayscale.  USPS: A total of 9,298 handwritten digit images (7,291 3.2 Implementation and Results of the Experiments training images and 2,007 test images), 16 × 16 grayscale.  Fashion-MNIST: A total of 70,000 fashion item images As shown in the Fig. 2, after performing the proposed clus- (60,000 training images and 10,000 test images), 10 classes, tering for sample points (b), (d), (f) and (h), inter-cluster × 28 28 grayscale. separability and intra-cluster homogeneity were further im- We normalized all of these to values from 0 to 1. proved in (a), (c), (e), and (g). In particular, the proposed method has an effect of separating adjacent clusters well such as (a), (c), (e), and (g) where the clusters overlap. LETTER 779

Table 1 Comparison of experimental results with other prominent clustering methods.

In our experiments, the reconstruction, center, separa- Fund project for infectious disease research (GFID), Re- tion, and clustering loss functions are all used to improve public of Korea (grant number: HG19C0682). The work clustering performance. Each loss function has its own of Murray Loew was supported by GWU (KU-GWU Joint weighting parameters and they can be found heuristically. Research Fund). MNIST: α = 1.0, β = 0.8, γ = 1.0, ω = 1.0. MNIST-test: α = 20.0, β = 1.0, γ = 0.5, ω = 0.5. References USPS: α = 1.0, β = 0.5, γ = 0.1, ω = 0.1. Fashion-MNIST: α = 1.0, β = 0.01, γ = 0.1, ω = 0.1. [1] J. Macqueen, et al., “Some methods for classification and analysis of multivariate observations,” Proc. fifth Berkeley symposium on math- The CAE used to extract the learned features is divided ematical statistics and probability, pp.281–297, 1967. into an encoder and a decoder section, each consisting of [2] L. Rousseeuw and P.J. Kaufman, “Clustering by means of medoids,” three layers. The first and second layers of the encoder have Faculty of Mathematics an Informatics, 1987. 32 and 64 filters, respectively, with a kernel size of 5×5 and [3] R.F. Ling, “On the theory and construction of k-clusters,” The Com- a stride of 2. The third layer has 128 filters and a kernel size puter Journal, vol.15, no.4, pp.326–332, 1972. of 3 × 3. The decoder section has the same structure as the [4] L. Rokach and O. Maimon, “Clustering methods,” Data min- ing and knowledge discovery handbook, Springer, Boston, MA, encoder layers in reverse order. pp.321–352, 2005. We compare the results from our experiments with 15 [5] H. Wang, Z. Xu, and H. Ko, “Random Binary Local Patch Clus- existing state-of-the-art clustering algorithms to verify the tering Transforms Based Image Matching for Nonlinear Intensity performance of our proposed method (Table 1). The num- Changes,” Mathematical Problems in Engineering, 2018. bers in red and blue represent the best and second-best per- [6] C. Amendola,´ J.-C. Faugere, and B. Sturmfels, “Moment varieties of Gaussian mixtures,” arXiv preprint arXiv:1510.04654, 2015. formance, respectively. As shown in the table, our proposed [7] J.C. Dunn, “A fuzzy relative of the ISODATA process and its use in algorithm outperforms the other algorithms for all datasets detecting compact well-separated clusters,” Journal of Cybernetics, and metrics (e.g. ACC and NMI). Compared to DEC as a pp.32–57, 1973. prominent deep clustering algorithm, the proposed method [8] C. Song, F. Liu, Y. Huang, L. Wang, and T. Tan, “Auto-encoder achieved 15.8% improvement in ACC and 19.9% gain in based data clustering,” Iberoamerican congress on pattern recogni- NMI for the Fashion MNIST data set. tion, Springer, Berlin, Heidelberg, pp.117–124, 2013. [9] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embed- ding for clustering analysis,” International Conference on Machine 4. Conclusions Learning, pp.478–487, 2016. [10] X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded In this paper, we proposed a novel approach for deep clus- clustering with local structure preservation,” IJCAI, pp.1753–1759, tering. Experimental results have shown that our proposed 2017. algorithm can effectively cluster unlabeled datasets by ap- [11] F. Li, H. Qiao, and B. Zhang, “Discriminatively boosted image clus- plying cohesive loss functions. The proposed clustering tering with fully convolutional auto-encoders,” , vol.83, pp.161–173, 2018. method includes reconstruction, intra-cluster homogeneity, [12] K.G. Dizaji, A. Herandi, C. Deng, W. Cai, and H. Huang “Deep inter-cluster separability, and KL-divergence loss functions clustering via joint convolutional autoencoder embedding and rela- and is shown to exceed the performance of the existing clus- tive entropy minimization,” Proc. IEEE international conference on tering algorithms. In addition, our proposed method can computer vision, pp.5736–5745, 2017. be customized according to the task by adjusting the loss [13] X. Guo, et al., “Deep embedded clustering with data augmentation,” weights. Asian Conference on , pp.550–565, 2017. [14] Student, “The probable error of a mean,” Biometrika, vol.6, no.1, pp.1–25, 1908. Acknowledgments [15] S. Kullback and R.A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol.22, no.1, pp.79–86, 1951. This research was supported by Government-wide R&D [16] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature IEICE TRANS. INF. & SYST., VOL.E104–D, NO.5 MAY 2021 780

learning approach for deep face recognition,” European conference [20] W. Zhang, X. Wang, D. Zhao, and X. Tang, “Graph degree link- on computer vision, Springer, Cham, pp.499–515, 2016. age: Agglomerative clustering on a directed graph,” European [17] X. Chen and D. Cai, “Large scale spectral clustering with landmark- Conference on Computer Vision, Springer, Berlin, Heidelberg, based representation,” Twenty-fifth AAAI Conference on Artificial pp.428–441, 2012. Intelligence, pp.313–318, 2011. [21] B. Yang, et al., “Towards k-means-friendly spaces: Simultaneous [18] D. Cai, et al., “Locality preserving nonnegative matrix factoriza- deep learning and clustering,” International Conference on Machine tion,” IJCAI, pp.1010–1015, 2009. Learning, pp.3861–3870, 2017. [19] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Ex- [22] Z. Jiang, et al., “Variational deep embedding: An unsuper- tracting and composing robust features with denoising autoen- vised and generative approach to clustering,” arXiv preprint coders,” Proc. 25th international conference on Machine learning, arXiv:1611.05148, 2016. pp.1096–1103, 2008.