A Hybrid Autoencoder Network for Unsupervised Image Clustering

algorithms Article A Hybrid Autoencoder Network for Unsupervised Image Clustering Pei-Yin Chen and Jih-Jeng Huang * Department of Computer Science & Information Management, SooChow University, No.56 Kueiyang Street, Section 1, Taipei 100, Taiwan; [email protected] * Correspondence: [email protected] Received: 29 April 2019; Accepted: 13 June 2019; Published: 15 June 2019 Abstract: Image clustering involves the process of mapping an archive image into a cluster such that the set of clusters has the same information. It is an important field of machine learning and computer vision. While traditional clustering methods, such as k-means or the agglomerative clustering method, have been widely used for the task of clustering, it is difficult for them to handle image data due to having no predefined distance metrics and high dimensionality. Recently, deep unsupervised feature learning methods, such as the autoencoder (AE), have been employed for image clustering with great success. However, each model has its specialty and advantages for image clustering. Hence, we combine three AE-based models—the convolutional autoencoder (CAE), adversarial autoencoder (AAE), and stacked autoencoder (SAE)—to form a hybrid autoencoder (BAE) model for image clustering. The MNIST and CIFAR-10 datasets are used to test the result of the proposed models and compare the results with others. The results of the clustering criteria indicate that the proposed models outperform others in the numerical experiment. Keywords: image clustering; convolutional autoencoder (CAE); adversarial autoencoder (AAE); stacked autoencoder (SAE) 1. Introduction Data mining (DM) is the one of the key processes of knowledge discovery in databases (KDD) to transform raw data into interesting information and knowledge [1] and is the main research topic in the field of machine learning and artificial intelligence. The classification of data mining depends on the tasks of the problem; one of the tasks is clustering, which groups similar data into a segment. Here, we focus on image clustering, which is an essential issue in machine learning and computer vision. Hence, the purpose of image clustering is to group images into clusters such that an image is similar to others within a cluster. Many statistical methods, e.g., k-means or DBSCAN, have been used in image clustering. However, these methods have difficulty in handling image data, since images are usually high-dimensional, resulting in the poor performance of these traditional methods [2]. Recently, with the development of deep learning, more neural network models have been developed for image clustering. The most famous one is the autoencoder (AE) network, which firstly pre-trains deep neural networks with unsupervised methods and employs traditional methods, e.g., k-means, for clustering images in post-processing. More recently, several autoencoder-based networks have been proposed—e.g., the convolutional autoencoder (CAE) [3], adversarial autoencoder (AAE) [4], stacked autoencoder (SAE) [5], variational autoencoder (VAE) [6,7], etc.—and these models have been reported to achieve great success in the fields of supervised and unsupervised learning [8–10]. However, as we know of no technique or model which outperforms others in all situations, every model has its specialty to specific tasks or functions. Therefore, although numerous autoencoder-based Algorithms 2019, 12, 122; doi:10.3390/a12060122 www.mdpi.com/journal/algorithms Algorithms 2019, 12, 122 2 of 11 models have been proposed to learn feature representations from images, the best one depends on the situation. Hence, in this paper, we consider a hybrid model, namely the hybrid autoencoder (BAE), to integrate the advantages of three autoencoders, CAE, AAE, and SAE, to learn low and high-level feature representation. Then, we use the k-means to cluster images. The concept of hybrid modelsAlgorithms is not a 2019 new, 12, idea.x FOR PEER For REVIEW example, [11] integrated AE and density estimation models2 of 11 to take advantage of their different strengths for anomaly detection. [12] proposed a hybrid spatial–temporal best one depends on the situation. Hence, in this paper, we consider a hybrid model, namely the autoencoderhybrid toautoencoder detect abnormal (BAE), to eventsintegrate in the videos advantages by the of three integration autoencoders, of a CAE, long AAE, short-term and SAE, memory (LSTM)to encoder–decoder learn low and high- andlevel thefeature convolutional representation. autoencoder. Then, we use the These k-means results to cluster of the images. hybrid The models outperformconcept other of hybrid state-of-the-art models is not models. a new idea. For example, [11] integrated AE and density estimation Themodels differences to take of advantage the proposed of their method different from strengths others for are anomaly that first detection. we focus [12] onproposed the task a hybrid of clustering, and thenspatial we integrate–temporal di autoencoderfferent AE-based to detect methods, abnormal which events were in videos not considered by the integration previously. of a Inlong addition, short-term memory (LSTM) encoder–decoder and the convolutional autoencoder. These results of in our experiment, we use two datasets, MNIST and CIFAR-10, which are famous image datasets, the hybrid models outperform other state-of-the-art models. to compareT thehe clusteringdifferences of performance the proposed withmethod others. from Theothers experiment are that first results we focus indicate on the thetask clustering of performanceclustering of,the and proposed then we methodintegrate isdifferent better AE than-based that methods of others, which with were respect not toconsider the resultsed of unsupervisedpreviously. clustering In addition, accuracy in our (ACC),experiment, normalized we use two mutual datasets information, MNIST and (NMI),CIFAR-10, and which adjusted are rand index (ARI).famous image datasets, to compare the clustering performance with others. The experiment results indicate the clustering performance of the proposed method is better than that of others with respect 2. Autoencoder-Basedto the results of unsupervised Networksfor clustering Clustering accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). An AE is a neural network which is trained to reconstruct the input from the hidden layer. The main2. Autoencoder characteristic-Based of Ne antworks AE is for the Clustering encoder–decoder network, which is used to train for the representationAn AE code. is a neural The representation network which is code trained usually to reconstruct has a smallerthe input dimensionalityfrom the hidden layer. than The the input layer, canmain be characteristic considered asof thean AE compressed is the encoder feature–decoder representation network, which of the is originalused to trai variablesn for the and can be usedrepresentation for further data code. mining The representation tasks, e.g., dimensioncode usually reductionhas a small [er13 dimensionality,14], classification than the and input regression layer, can be considered as the compressed feature representation of the original variables and can models [15,16], and clustering analysis [8,12]. In addition, the feature representation from an AE can be used for further data mining tasks, e.g., dimension reduction [13,14], classification and regression presentmodels high-level [15,16] features, and clustering rather analysis than the [8,12] low-level. In addition, features the feature which representation are only derived from anby AEtraditional can methods,present e.g., high HOG-level [17 features] or SIFT rather [18 than], and the low may-level suff featureser from which appearance are only derived variations by tradition of scenesal and objects [methods,2]. With e.g., the HOG introduction [17] or SIFT of [18] the, conceptand may ofsuffer deep from learning, appearance traditional variations AEsof scenes were and extended as SAE byobjects adding [2]. With multiple the introduction layers, usually of the concept larger of than deep 2 learning, layers, totraditional form the AE deeps were structure extended ofas an AE. The detailedSAE by content adding ofmultiple SAE canlayers, be usually found large in [19r than,20]. 2 layers, to form the deep structure of an AE. The detailed content of SAE can be found in [19,20]. n Let a set of unlabeled training data be x1, x2, ::: , xn , where xi R n. Then, the structure of an AE Let a set of unlabeled training data bef {x , x ,..., x g }, where x 2 . Then, the structure of an can be depicted, as shown in Figure1. 12 n i AE can be depicted, as shown in Figure 1. FigureFigure 1. The1. The structure structure ofof an autoencoder autoencoder (AE (AE).). Algorithms 2019, 12, 122 3 of 11 The purpose of an AE is to learn representation by minimizing the construction loss (x, xˆ). L The weights between the input layer and hidden layer are called the encoder, and the weights between the hidden layer and output layer are called the decoder. The bottleneck code, also called code/latent representation, indicates the compressed knowledge representation of the original input. n 1 k 1 Hence, if we set the input vector x R × , the output vector of the encoder is h R × , and the n 1 2 2 output vector of the decoder is xˆ R × ; the representation code can be represented as 2 h = σh(ah) = σh(Wx + bh) (1) k 1 where σh( ) denotes the activation function of the hidden layer, ah R × is the summation of the · k n 2 k 1 hidden layer, W R × is the weight matrix between the input layer and hidden layer, and bh R × 2 2 denotes the bias vector of the hidden layer. The output of an AE is calculated as xˆ = σo(ao) =σo(Whˆ + bo) (2) n 1 where σo( ) denotes the activation function of the output layer, ao R × is the summation of the · n k 2 n 1 output layer, Wˆ R × is the weight matrix from the hidden layer to output layer, and bo R × is the 2 2 bias vector of the output layer.

Load more