<<

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

How to cite this paper:

Roy, S. S., Ahmed, M., & Akhand, M. A. H. (2018). Noisy image classification using hybrid deep learning methods. Journal of Information and Communication Technology, 17 (2), 233–269.

NOISY IMAGE CLASSIFICATION USING HYBRID DEEP LEARNING METHODS

1Sudipta Singha Roy, 2Mahtab Ahmed & 2Muhammad Aminul Haque Akhand 1 Institute of Information and Communication Technology Khulna University of Engineering & Technology, Khulna, Bangladesh 2 Dept. of Computer Science and Engineering Khulna University of Engineering & Technology, Khulna, Bangladesh

[email protected]; [email protected]; [email protected]

ABSTRACT

In real-world scenario, image classification models degrade in performance as the images are corrupted with , while these models are trained with preprocessed data. Although deep neural networks (DNNs) are found efficient for image classification due to their deep layer-wise design to emulate latent features from data, they suffer from the same noise issue. Noise in image is common phenomena in real life scenarios and a number of studies have been conducted in the previous couple of decades with the intention to overcome the effect of noise in the image data. The aim of this study was to investigate the DNN-based better noisy image classification system. At first, the (AE)-based denoising techniques were considered to reconstruct native image from the input noisy image. Then, convolutional neural network (CNN) is employed to classify the reconstructed image; as CNN was a prominent DNN method with the ability to preserve better representation of the internal structure of the image data. In the denoising step, a variety of existing AEs, named denoising autoencoder (DAE), convolutional denoising autoencoder (CDAE) and denoising variational autoencoder

Received: 17 October 2017 Accepted: 10 February 2018

233 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

(DVAE) as well as two hybrid AEs (DAE-CDAE and DVAE- CDAE) were used. Therefore, this study considered five hybrid models for noisy image classification termed as: DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE- CDAE-CNN. The proposed hybrid classifiers were validated by experimenting over two benchmark datasets (i.e. MNIST and CIFAR-10) after corrupting them with of various proportions. These methods outperformed some of the existing eminent methods attaining satisfactory recognition accuracy even when the images were corrupted with 50% noise though these models were trained with 20% noise in the image. Among the proposed methods, DVAE-CDAE-CNN was found to be better than the others while classifying massive noisy images, and DVAE-CNN was the most appropriate for regular noise. The main significance of this work is the employment of the hybrid model with the complementary strengths of AEs and CNN in noisy image classification. AEs in the hybrid models enhanced the proficiency of CNN to classify highly noisy data even though trained with low level noise.

Keywords: Image denoising, CNN, denoising autoencoder, convolutional denoising autoencoder, variational denoising autoencoder, hybrid architecture.

INTRODUCTION

In recent years, deep learning approaches have been extensively studied for image classification and image processing tasks such as perceiving the underlying knowledge from images. Deep neural networks (DNN) utilize their deep layer-wise design to emulate latent features from data and thus pick up the possibility to appropriately classify patterns. Arigbabu et al. (2017) combined Laplacian filters over images with the Pyramid Histogram of Gradient (PHOG) shape descriptor (Bosch, et al., 2007) to extract face shape description. Later, they used the Support Vector Machine (SVM) (Cortes & Vapnik, 1995) for face recognition tasks. One progressive feature of extracting variants of DNNs, the convolutional neural network (CNN) (LeCun et al.,1998; Krizhevsky et al.,2012; Schmidhuber, 2015), has surpassed the vast majority of the image classification methods. Different research work outcomes boldly indicate that feature selection from deep learning with CNN should be the primary candidate in most of the image recognition tasks (Sharif et al. 2014). The convolution and the following pooling (Scherer et al., 2010)

234 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 layers preserve the possession of the corresponding location of features and along these lines make the CNN empowered to preserve a better epitome of the input data. Current CNN works are concentrated on computer vision issues, for example 3D objects recognition, traffic signs and natural images classification (Huang and LeCun, 2006; Cireşan et al., 2011a; Cireşan et al., 2011b), image segmentation (Turaga et al., 2010), face detection (Matsugu et al., 2003), chest pathology identification (Bar et al., 2015), Magnetic Resonance Image (MRI) segmentation (Bezdek et al., 1993) and so on. However, the performance of deep CNN highly depends on the tremendous amount of pre-processed labeled data. Simonyan (2013) proposed an improved variant of the Fisher vector image encoding method and combined it with a CNN to develope a hybrid architecture that can classify images requiring a comparatively smaller computational cost than the traditional models, as well as assess the performance of the image classification pipeline with increased depth in layers.

Some variants of deep models, named unsupervised deep networks, learn underlying representation from input images overcoming the necessity of these input data to be labeled. One traditional model of this type is stacked (SAE) (Bourlard and Kamp, 1988; Bengio, 2009; Rumelhart, 1985) in which the basic architecture holds a stack of shallow encoders which enable them to learn features from the data by of encoding the input data into a vector and then decoding this vector to its native representation. Shin et al. (2013) pertained the stacked sparse autoencoders (SSAEs) for medical image classification task and achieved notable promotion in classification accuracy. Norouzi et al. (2009) introduced the stacked convolutional restricted Boltzmann machine (SCRBM) which incorporates dimensional locality and also weight sharing by maintaining the stack of the convolutional restricted Boltzmann machine (CRBM) to build deep models. Lee et al. (2009) introduced convolutional deep belief network (CDBN), which places the CRBM in each layer instead of RBM unlike the deep belief network (DBN), and utilization convolution structure to join the layers and thus build hierarchical models. Contrasted with the conventional DBN, it preserves spatial locality and enhances the performance of feature representation (Hinton et al., 2006). With comparable thoughts, Zeiler et al. (2010, 2011) proposed a deconvolutional deep model in view of the conventional sparse coding technique (Olshausen and Field, 1997). The deconvolution operation depends on the convolutional deterioration of information under a sparsity imperative. It is a modification of the traditional sparse coding methods. Contrasted with sparse coding, it can learn better feature representation.

235 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Data subjected to noise is a hinder once to the success of the deep network- based image recognition systems in real world applications. Nonetheless, in most of the cases in real life scenarios, during transmission and acquisition, digital images are adulterated with noise resulting in degenerating the performance of image classification, medical image diagnosis, etc. One major issue originating from one of the intrinsic attributes of a DNN is its affectability to the input data. Because of being sensitive to little perturbance, DNNs may be misled and misclassify an image having a certain amount of imperceptible perturbation (Szegedy et al., 2013). As a result, when there is noise present in the input data, learned features by the DNN may not be vigorous. As examples, medical imaging techniques which are vulnerable to noise such as: MRI, X-rays, Computer Tomography (CT) can be considered (Sanches et al., 2008). Reasons fluctuate from the utilization of various image acquisition systems to endeavors at diminishing patients’ introduction to radiation. As the measure of radiation is diminished, there is adulteration of the images with noise increments (Gondara, 2016; Agostinelli et al., 2013). A survey conducted by Lu and Weng (2007) investigated the image classification methods and suggested that image denoising prior to classification is efficient in case of remotely sensed data in a thematic map such as the geographical information system (GIS). Even if, the classifier is trained with noisy data, it does not show a much better performance in case of image classification. So, image denoising has become a compulsory requirement prior to feeding the image to the classifier in order to achieve a better classification result.

A notable number of researches have been directed over image denoising in the time period of the previous couple of years to make the deep learning-based image classification systems more compatible with practical applications. In the past, research in this field hasconducted where denoising was accomplished on the premise of the transformation technique (Coifman and Donoho, 1995), the partial differential equation-based methods (Perona and Malik, 1990; Rudin and Osher, 1994; Subakan et al., 2007), and in addition conveyed scant coding approaches (Elad and Aharon, 2006; Olshausen and Field, 1997; Mairal et al., 2009). Singh et al. (2014) proposed an efficient classification model for multi-class object images subject to . They applied wavelet transform-based image denoising techniques by means of employing the NeighShrink thresholding over the wavelet coefficients to eliminate wavelet coefficients causing noise in the image and picking up only useful ones.

236 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Recent studies have effectively utilized deep learning-based approaches with the intention to accomplish image denoising (Krizhevsky et al., 2012; Bengio et al., 2007; Glorot et al., 2011). Burger et al. (2012) demonstrated that similar execution to the previously described strategies can be accomplished by applying plain multi-layer perception (MLP). Jain et al. (2009) employed CNN to denoise images which performed superior to notwithstanding utilizing a smaller set of training images. An assortment of autoencoders (AEs) has been employed to denoise images and these techniques have definitely surpassed the conventional denoising methods as they are less restrictive for details of noise generative mechanisms (Cho, 2013; Vincent et al., 2008; Vincent et al., 2010). Vincent et al. (2008) introduced the denoising autoencoder (DAE) which figures out how to recreate local images from adulterated forms by injecting arbitrary noise into the images of the training set amid the learning period. These DAEs are stacked to develop a deep unsupervised learning network called stacked DAE (SDAE) for adapting profound depiction (Vincent et al., 2010). Xie et al. (2012) deployed a combination of sparse coding along with DAE for tasks of image denoising and blind inpainting. It was designed to work with images subject to white Gaussian noise and superimposed text. Cho (2013) employed Boltzmann machines as well SDAEs for image denoising tasks in case of high level of noise injected in the images. He employed three distinct depth settings (one, two and four layers) for both the SDAEs and the Boltzmann machines to evaluate the performance of noise omission. Agostinelli et al. (2013) introduced the adaptive multi-column DNN with a combination of multiple- stacked sparse DAEs (SSDAE) that can denoise various types of noises in the images in a standalone manner. They computed optimal column weights using a nonlinear optimization program and later trained the individual networks to anticipate the optimal weights. One common disadvantage of these DAE- based models is that they learn the underlying hierarchical features from the image by reshaping the high dimensional data to vectors and thus discard the intrinsic structures of the images.

With the intention to solve this problem, Masci et al. (2011) proposed another variant of the autoencoder called convolutional AE (CAE) which trains itself for reconstructing images from the input image data in a convolutional manner. The stacked CAE forces the adjacent CAEs to learn the innate structure of the input image throughout the series of convolution and pooling operations. The kernels and other learning parameters of each layer are updated by backpropagation to convolve the feature maps of the input images into more abstract features of each layer. Compared to previously specified AEs it has

237 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 proved its capability to preserve more relating structural information. Xu et al. (2014) developed a deep CNN that can figure out the characteristics of blur degradation from an image. Gondara (2016) employed DAEs constructed with convolutional layers for denoising medical images. Du et al (2017) proposed stacked convolutional denoising autoencoders (SCDAE) by stacking DAEs in a convolutional way where each layer produces high dimensional feature maps by means of convolving features of the previous layer trained by a DAE. Recently, Kingma and Welling (2014) introduced the variational autoencoder (VAE), a hybrid of deep learning model along with variational inference that has prompted remarkable advances in generative modelling. The loss function used for training VAE is calculated by a variational upper bound on the log- likelihood of the data. It can figure out and preserve shape variability beyond the image set as well as reconstruct images given the manifold coordinates. Unlike other deterministic models, it is a probabilistic generative model which is trained all through with stochastic gradient descent. Unlike DAE that corrupts the input images by adding noise at the input level and later learns to reconstruct the clear image, VAE learns with noise added in its stochastic hidden layer. Im et al. (2017) proposed that adding noise in not only the stochastic hidden layer but also in the input layer is beneficial and empowers the VAE to perform image denoising tasks. They proposed a modified training criterion for denoising variational autoencoders (DVAE) that resemble a tractable bound, in case the input image is adulterated with noise.

The intention of this work was to build a few supervised image classifiers that can demonstrate better classification results across a noisy image set; thereby, contemplating DAE, CDAE, DVAE and proposing some hybrid models utilizing CNN along with these AEs. Initially a DAE, a CDAE and a DVAE were trained with image data subject to lower regular noise level so that they could omit noise from the input images and reconstruct a native form of it. To counter the massive noisy images, two hybrid structures (i.e. DAE-CDAE and DVAE-CDAE) were further investigated where for each of them two AEs were deployed in a cascaded manner. The reconstructed images from these AEs were fed to a following CNN for classification, where the CNN is trained with raw images having zero percent noise injected into it. The classification performance of this CNN is solely dependent on the quality of the reconstructed images from the conventional as well the hybrid AE structures. The DAE-CDAE-CNN as well as DVAE-CDAE-CNN models can work better with massive noisy images because of their cascaded architectures and thus omits the requirement of training with images corrupted by noise of different levels.

238 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

HYBRID DEEP LEARNING-BASED NOISY IMAGE CLASSIFICATION

Real world image classification tasks suffer from noise and other imperfections existing in the image data. So, denoising images prior to classification is compulsory. Noisy image classification tasks incorporate two steps, i.e. image denoising and image classification. This section first explains some conventional models for image denoising based on AEs as well as image classification with CNN. Then it presents the proposed hybrid methods consisting different cascaded AEs plus CNN.

CONVENTIONAL METHODS FOR IMAGE DENOISING AND CLASSIFICATION

Convolutional Neural Network (CNN) as Image Classifier

CNNs (LeCun et al., 1998) which are multiple-layered variants of artificial neural network (ANN) are well applied to classify images and perceive visual patterns straightforwardly from images. In a CNN architecture, the information propagation throughout its multiple layers allows it to extract features from the perceived data at layers apiece by means of applying digital filtering techniques. CNNs perform on the basis of two main processes: convolution and subsampling. During the convolution process, a small-sized kernel is applied over input feature map (IFM) and produces a convolved feature map (CFM). The first set of CFMs are produced by applying the convolutional operation over the original input image. Here, a kernel is only an arrangement of weights and a bias.( , Every) = ( particular𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , point) in( the, CFM) + is) (1) gained by applying the same kernel over every small portion of the IFM, 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 called a local receptive field (LRF). In this 𝑖𝑖way,=1 𝑗𝑗= 1weights are shared among all positions throughout the convolutional process and spatial locality is = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 + ) (1) ( preserved., ) The CFM( , ) computed( , from) an IFM would be, 𝔅𝔅 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1

𝒽𝒽 𝓌𝓌 𝒽𝒽 𝓌𝓌 ( , ) = ( 𝐾𝐾 𝐾𝐾 ( , ) ( , ) + ) (1) (1) ( , ) = ( 𝐾𝐾 𝐾𝐾 ( , ) ( , ) + ) (1) 𝑥𝑥 𝑦𝑦𝔅𝔅 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝒻𝒻 𝑥𝑥 𝑦𝑦 𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖 𝑗𝑗 𝒻𝒻 ∑𝑥𝑥+𝑖𝑖 ∑𝑦𝑦+𝑗𝑗 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑖𝑖=1 𝑗𝑗=1 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1 where and represent the bias of the kernel × activation function respectively, whereas the 2-D convolution is symbolized by *. Throughout all 𝔅𝔅 𝒽𝒽 𝓌𝓌 the experiments 𝔅𝔅 𝒻𝒻 here, the scaled sigmoid activation𝐾𝐾 function𝐾𝐾 as well as a single bias is used for every latent map used. While particular kernels may create × distinct CFMs from the same IFM operations of numerous kernels are formed ( , ) = 𝐶𝐶−1 (2) 𝒽𝒽 𝓌𝓌 𝒻𝒻 𝑅𝑅−1 ( , ) to deliver𝒻𝒻 𝐾𝐾CFMs𝐾𝐾 for different IFMs. 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 ) × 𝑖𝑖=0 𝑗𝑗=0 × 239 ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( 𝒽𝒽 𝓌𝓌, ) (2) 𝒽𝒽 𝓌𝓌 𝐾𝐾 𝐾𝐾 𝐾𝐾 𝐾𝐾 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 1 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 ( ∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 )( , ) = 𝑛𝑛 ( ( ) ( )) , (3) 𝑖𝑖=0 𝑗𝑗=0 2 2 ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( 𝑜𝑜 , 𝑜𝑜 ) 𝑜𝑜 𝑜𝑜 (2) ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( , ) 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ (2𝑑𝑑) 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝑛𝑛 𝑖𝑖=1 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝑆𝑆𝑆𝑆𝑆𝑆1𝑥𝑥 𝑦𝑦 𝑥𝑥𝑥𝑥−𝒹𝒹1+(𝑖𝑖∑𝑦𝑦𝑦𝑦−∑1+𝑗𝑗 𝐶𝐶𝐶𝐶𝐶𝐶 ) 𝑆𝑆 𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 ( 𝒹𝒹, (∑) =∑ 𝐶𝐶𝐶𝐶𝐶𝐶𝑛𝑛 ( ( ) 𝑖𝑖=0 𝑗𝑗(=0 ))) , (3) 𝑖𝑖=0 𝑗𝑗=20 2 [0,1] 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑛𝑛 𝑖𝑖=1 1 1 𝑥𝑥 ∈ ( , ) = 𝑛𝑛 ( ( ) ( )) , (3) ( , ) = 𝑛𝑛 ( ( ) (2 )) , (3) 2 2 [0,1𝑜𝑜] 𝑜𝑜 2 𝑜𝑜 𝑜𝑜 [0,1] 𝑜𝑜 𝑜𝑜 𝔼𝔼 𝑜𝑜𝑑𝑑 𝑦𝑦 𝑜𝑜 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑− 𝑑𝑑𝑦𝑦𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝒫𝒫𝑛𝑛 𝑖𝑖=1 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑛𝑛 𝑖𝑖=1 𝑥𝑥 ∈ 𝑥𝑥̃ ∈ [0,1] [0 , 1 ] [ 0 , 1 ] = ( | , ) (3) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑥𝑥𝑑𝑑𝑑𝑑∈𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥 ∈ 𝑥𝑥̃ ∈ 𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ [0,1] [ 0 , 1 ] = ( | , ) = ( + (3)) (4) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥̃ ∈ 1 1 𝑥𝑥̃ ∈ 𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ 𝑦𝑦 ℱ ℳ 𝑥𝑥̃ ℬ = ( | , ) (3) = =( | ( , ) + ) = ( ( 3 )( 4+) ) (5)

1 𝑥𝑥̃ 1 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ 2 2 𝑥𝑥̃ 𝑦𝑦𝔇𝔇 𝑥𝑥̃ℱ𝑥𝑥ℳ℘ 𝑥𝑥̃ ℬ 𝑧𝑧 ℊ ℳ 𝑦𝑦 ℬ = ( + ) (4) = ( = +( ) + ) ( 4 1) (5) ( ) ( ) 1 1 x, z = 𝑛𝑛 (6) 1 21 𝑦𝑦 2ℱ ℳ 𝑥𝑥̃ ℬ 2 𝑦𝑦 ℱ 𝑧𝑧ℳ 𝑥𝑥̃ℊ ℳℬ 𝑦𝑦 ℬ 2 Δ ∑ 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 = ( + ) 𝑛𝑛 𝑖𝑖 = 1 (5) = ( +1 ) (5) (x, z) = ( ) (6) 𝑛𝑛 2 2 2 2 2 𝑧𝑧 ℊ ℳ 𝑦𝑦 ℬ 𝑧𝑧 ℊ ℳ 𝑦𝑦 ℬ 2 Δ ∑ 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 𝑛𝑛 𝑖𝑖=1 1 1 (x, z) = 𝑛𝑛 ( ) (6) (x, z) = 𝑛𝑛 ( ) 2 (6) 2 2 2 𝑖𝑖 𝑖𝑖 𝑖𝑖 Δ 𝑖𝑖 ∑ 𝑥𝑥 − 𝑧𝑧 Δ ∑ 𝑥𝑥 − 𝑧𝑧 𝑛𝑛 𝑖𝑖=1 𝑛𝑛 𝑖𝑖=1 ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1)

𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 𝔅𝔅

In CNN, each convolutional layer is followed by a subsampling layer to simplify the feature map gained from the convolution operation. This simplification 𝒻𝒻 process is done by selecting significant features from a region and discarding the rest (Du et al., 2017). Among various sub-sampling methods, max-pooling = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 × + ) (1) (Scherer( , et) al., 2010) was( , ) used throughout( , ) our experiments. It takes the 𝒽𝒽 𝓌𝓌 maximum𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 incentive𝒻𝒻 ∑ over∑𝐾𝐾𝐾𝐾 non-overlapping𝑖𝑖 𝑗𝑗 𝐾𝐾∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 sub-locales𝔅𝔅 and can be defined as: 𝑖𝑖=1 𝑗𝑗=1

= ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 + ) (1) ( ) ( , ) ( , ) ( , ) 𝒽𝒽 ,𝓌𝓌𝒽𝒽 =𝓌𝓌 𝔅𝔅 𝑅𝑅−1 𝐶𝐶−1 ( , ) (2) ( , )(=, ) (=𝐾𝐾 (𝐾𝐾𝐾𝐾 𝐾𝐾( , ) ( , ) ( ,( ), + ) +) ) ( 1 ) (1) (2) 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑦𝑦𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑗𝑗𝒹𝒹 𝑖𝑖(𝑗𝑗∑ ∑𝑥𝑥+𝑖𝑖 𝐶𝐶𝐶𝐶𝐶𝐶𝑦𝑦𝑥𝑥++𝑗𝑗𝑖𝑖 𝑦𝑦+𝑗𝑗 ) 𝑖𝑖=1 𝑗𝑗=1 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑𝒻𝒻 ∑∑𝐾𝐾∑ 𝐾𝐾∗ 𝐼𝐼𝐼𝐼𝐼𝐼𝑖𝑖 =∗0𝐼𝐼𝑗𝑗𝐼𝐼𝐼𝐼=0 𝔅𝔅 𝔅𝔅 𝑖𝑖=1 𝑗𝑗𝑖𝑖==11 𝑗𝑗=1 where R and C denote size of the pooling area as R × C matrix and d denotes 𝒻𝒻 𝔅𝔅 the subsampling operation on the pooling area. The size of SFM becomes half 𝔅𝔅 𝔅𝔅 1 of the CFM ( if R, ×) C= is× 2 𝑛𝑛× (2. In( max-pooling,) ( )) , each point in the SFM (is3 )the 2 maximum value computed from a particular 22 × 2 locale of the CFM (Akhand 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝔼𝔼 𝑑𝑑 𝑦𝑦 𝐾𝐾𝒽𝒽 𝐾𝐾∑𝓌𝓌 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 et al., 2016, 2017). 𝑛𝑛 𝑖𝑖=1 𝒻𝒻 𝒻𝒻 𝒻𝒻 In CNN, the series of convolution-subsampling operation is followed× by a [0,1] hidden (layer, ) =and× then× 𝑅𝑅− 1an𝐶𝐶 − 1output( layer ,sequentially.) Where nodes of a( 2hidden) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 layer and output𝒽𝒽 𝒽𝒽 layers𝓌𝓌𝑥𝑥 ∈𝓌𝓌 are fully connected𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1 there+𝑗𝑗 lies a linear representation of 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦𝐾𝐾 𝐾𝐾𝒹𝒹𝐾𝐾(∑𝐾𝐾 ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) terminal SFM values𝑖𝑖= as0 𝑗𝑗 a= 0hidden layer. The error in the classification task can

be measured from: [0,1] ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( , ) (2) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 ( , () =, ) = 𝑅𝑅−1 𝐶𝐶𝑅𝑅−1 𝐶𝐶1−1 ( ( , , ) ) ( 2 ) (2) 𝑥𝑥̃ ∈ 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 ( , ) = 𝑛𝑛 ( ( ) ( )) , 𝑆𝑆 𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 ( ∑ (3)∑ 𝐶𝐶𝐶𝐶𝐶𝐶(3) ) 2 𝑥𝑥𝑥𝑥−1𝑥𝑥𝑥𝑥+𝑖𝑖−𝑦𝑦𝑦𝑦1+−𝑖𝑖1𝑦𝑦𝑦𝑦+𝑗𝑗−1+𝑗𝑗 𝑖𝑖=0 𝑗𝑗=0 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑𝒹𝒹 (∑ 𝐶𝐶𝐶𝐶𝐶𝐶∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) 2) 𝔼𝔼 𝑑𝑑 𝑜𝑜 𝑖𝑖𝑦𝑦 = 𝑜𝑜 0 𝑗𝑗𝑖𝑖 = 0 𝑗𝑗 = 0∑=𝑑𝑑𝑜𝑜( 𝒫𝒫| −, 𝑦𝑦 𝑜𝑜) 𝒫𝒫 (3)

𝑛𝑛 𝑖𝑖=1 where n is the product𝑥𝑥̃ of the𝔇𝔇 𝑥𝑥̃total𝑥𝑥 ℘ number of patterns and the total1 number of ( ) output nodes1 in1 that particular classification task, every , particular= 𝑛𝑛pattern( ( ), ( )) , (3) [0,1] 2 ( , (and), = denotes ) = 𝑛𝑛 ( the𝑛𝑛 ( (desired)( ) output( ))( and ,) ) obtained , output respectively. ( 3 ) (3 The) learning 2 2 2 = ( + ) 𝑜𝑜 𝑜𝑜 (4) 𝑜𝑜 𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 2 2 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 parameters𝑜𝑜 𝑜𝑜𝑜𝑜 𝑜𝑜 are updated𝑜𝑜 𝑜𝑜 during𝑜𝑜 𝑜𝑜backpropagation. Throughout our experiment,𝑛𝑛 𝑖𝑖=1 𝔼𝔼 𝑑𝑑𝔼𝔼 𝑦𝑦𝑑𝑑 𝑦𝑦 ∑𝑥𝑥 ∈∑𝑑𝑑 𝒫𝒫𝑑𝑑 −𝒫𝒫𝑦𝑦−1𝒫𝒫𝑦𝑦 𝒫𝒫 1 back-propagation𝑖𝑖=1 (BP)𝑦𝑦 (Liuℱ ℳ et𝑥𝑥̃ al.,ℬ 2015; Bouvrie, 2006) was used for training 𝑛𝑛 𝑛𝑛 𝑖𝑖=1 the CNN. The CNN applied here in our experiment is demonstrated in Fig.1. It [0,1] [0,1] consists of two convolutional = ( layers+ )(conv1 and conv2) and two subsampling (5) [0,1[]0,1] 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 layers (sub1 and𝑥𝑥̃ sub2)∈ each following a single convolutional layer.𝑥𝑥 ∈ Throughout 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑧𝑧𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℊ𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℳ2𝑦𝑦 ℬ2 the experiments,𝑥𝑥 ∈ 𝑥𝑥 ∈ the CNN used here was trained with noiseless raw images. = ( | , ) [ 0 ,(13]) 1 Denoising[ Autoencoder0,1[]0,1] (DAE) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (x, z) = 𝑛𝑛 ( ) (6) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑥𝑥̃ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝔇𝔇 2𝑥𝑥̃ 𝑥𝑥 ℘ 𝑥𝑥̃ ∈ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 2 The DAE𝑥𝑥̃ ∈ 𝑥𝑥̃expands∈ Δ the conventional∑ 𝑥𝑥𝑖𝑖 − 𝑧𝑧 𝑖𝑖autoencoder alongside some stochastic = ( + ) (4) augmentations keeping in 𝑛𝑛mind 𝑖𝑖 = 1 the end goal to attain the ability to= reproduce ( | , ) (3) = ( |(, | ) ) (3) the native image = from its, 1noisy form 1 (Vincent et al., 2008). This noise (3) is usually 𝑦𝑦 ℱ ℳ 𝑥𝑥̃ ℬ 𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ included by physically utilizing deterministic distribution. The architecture of 𝑥𝑥̃ 𝑥𝑥̃𝔇𝔇 𝑥𝑥̃𝔇𝔇𝑥𝑥 𝑥𝑥̃℘𝑥𝑥 ℘ the DAE is demonstrated = ( in + Fig. ) 2. = (5) ( + ) (4) = ( + ) (4) = ( +2 ) 2 (4) 𝑧𝑧 ℊ ℳ 𝑦𝑦 ℬ 240 𝑦𝑦 ℱ ℳ1𝑥𝑥̃ ℬ1 𝑦𝑦 𝑦𝑦ℱ ℳℱ1𝑥𝑥ℳ̃ 1𝑥𝑥ℬ̃ 1 ℬ1 1 = ( + ) (5) (x, z) = 𝑛𝑛 ( ) (6) = =( ( +2 +) ) ( 5 ) (5) 2 𝑧𝑧 ℊ ℳ2𝑦𝑦 ℬ2 𝑖𝑖 𝑖𝑖 Δ 2 2 2∑2𝑥𝑥 − 𝑧𝑧 𝑧𝑧 𝑧𝑧ℊ ℳℊ 𝑦𝑦ℳ 𝑦𝑦ℬ𝑛𝑛 𝑖𝑖=ℬ1 1 (x, z) = 𝑛𝑛 ( ) (6) 1 1 2 ( x , z()x=, z) = 𝑛𝑛 ( 𝑛𝑛 ( ) ) ( 6 ) (6) 2 2 2 𝑖𝑖 𝑖𝑖 2 2 Δ ∑ 𝑥𝑥 − 𝑧𝑧 𝑛𝑛 𝑖𝑖=1 Δ Δ ∑ ∑𝑥𝑥𝑖𝑖 −𝑥𝑥𝑧𝑧𝑖𝑖𝑖𝑖− 𝑧𝑧𝑖𝑖 𝑛𝑛 𝑖𝑖=𝑛𝑛1 𝑖𝑖=1 ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1)

𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1 ( , ) = ( 𝐾𝐾 𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1) ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1) 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻𝔅𝔅 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1 = 𝑖𝑖(=𝐾𝐾1𝒽𝒽𝑗𝑗=𝐾𝐾1𝓌𝓌 + ) (1) ( , ) ( , ) ( , ) ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1) 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝔅𝔅 𝑥𝑥 𝑦𝑦 𝑖𝑖=1 𝑗𝑗𝔅𝔅= 1 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 𝒻𝒻 = ( 𝑖𝑖𝐾𝐾=𝒽𝒽1 𝑗𝑗𝐾𝐾=𝓌𝓌1 + ) (1) ( , ) ( , ) ( , )

𝑥𝑥 𝑦𝑦 𝔅𝔅 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 × 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑𝒽𝒽 ∑𝓌𝓌 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 ( , ) = (𝑖𝑖𝐾𝐾=1 𝑗𝑗𝐾𝐾=1𝔅𝔅 ( , ) ( , ) + ) (𝒻𝒻1) ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1) 𝒻𝒻 ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , )𝒽𝒽 𝓌𝓌( , ) + ) (1) 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝐾𝐾 𝐾𝐾 𝑥𝑥 𝑦𝑦 𝐶𝐶𝐶𝐶𝐶𝐶𝑖𝑖 𝑗𝑗 𝒻𝒻𝑥𝑥+∑𝑖𝑖 𝑦𝑦+∑𝑗𝑗 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 × 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑖𝑖=1 𝑗𝑗=𝔅𝔅1 𝔅𝔅 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝑖𝑖=1 𝑗𝑗=1 𝒽𝒽 ×𝓌𝓌 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑ 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 ( , ) = ( 𝐾𝐾 𝐾𝐾 𝒻𝒻 ( , ) ( , ) +𝑖𝑖=1)𝑗𝑗 = 1 (1) 𝒽𝒽 𝓌𝓌 𝒽𝒽 𝒻𝒻 𝓌𝓌 ( , ) = 𝐾𝐾 𝐾𝐾 (2) 𝑥𝑥 𝑦𝑦 𝐾𝐾 𝐾𝐾 𝑖𝑖 𝑗𝑗 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝑅𝑅−1 𝐶𝐶−1 ( , ) 𝐶𝐶𝐶𝐶𝐶𝐶 𝒻𝒻 ∑ ∑×𝔅𝔅 𝐾𝐾 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝔅𝔅 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1 × 𝔅𝔅 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝒻𝒻 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) 𝒽𝒽 𝓌𝓌 𝑖𝑖=0 𝑗𝑗=0 𝐾𝐾 𝐾𝐾 ( , ) = 𝐶𝐶−1 (2) 𝒽𝒽 𝓌𝓌 𝑅𝑅−1 ( , ) ( , ) = 𝑅𝑅−𝐾𝐾1 𝐶𝐶−𝔅𝔅 1 𝐾𝐾 ( , ) (2) ×𝒻𝒻 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝒻𝒻 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝒻𝒻 1 𝑖𝑖=0 𝑗𝑗=0 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑𝒽𝒽∑ 𝐶𝐶𝐶𝐶𝐶𝐶𝓌𝓌 ) ( , ) = 𝑖𝑖 = 𝑅𝑅 𝐾𝐾0 − 𝑗𝑗1 = 𝐶𝐶 0 − 𝐾𝐾1 ( , ( ) , ) = 𝑛𝑛 ( ( ) (2) ( )) , (3) × 2 × ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( , ) (2) 2 𝒻𝒻 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝑜𝑜 𝑜𝑜 × 𝑜𝑜 𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑𝒽𝒽 ∑ 𝐶𝐶𝐶𝐶𝐶𝐶𝓌𝓌 𝔼𝔼 𝑑𝑑)𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝒽𝒽 𝓌𝓌 𝑖𝑖𝐾𝐾=0 𝑗𝑗=0𝐾𝐾 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝑛𝑛 𝑖𝑖=1 1 𝐾𝐾 𝑆𝑆𝑆𝑆𝑆𝑆𝐾𝐾 𝑥𝑥 𝑦𝑦 𝒹𝒹1( ∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) 𝒽𝒽 ( ,𝓌𝓌 ) = 𝑛𝑛 ( ( ) ( )) , (3) ( , ) = 𝑖𝑖=0 𝑗𝑗=0 𝐾𝐾 𝐾𝐾 (2) ( , ) = 𝑅𝑅𝑛𝑛−1×(𝐶𝐶−1( ) ( ( )), , ) 2 (3) 2 2 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦2 −1+𝑗𝑗 𝔼𝔼 𝑑𝑑 [𝑦𝑦0,1] ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑥𝑥 𝑦𝑦𝑜𝑜 𝒹𝒹 (1∑𝒽𝒽 ∑𝑜𝑜 𝓌𝓌𝐶𝐶𝐶𝐶𝐶𝐶 𝑜𝑜 ) 𝑛𝑛 𝑖𝑖=1 𝔼𝔼 𝑑𝑑( 𝑦𝑦, ) = 𝐾𝐾∑𝑖𝑖𝑅𝑅=−01𝑑𝑑𝑗𝑗𝐶𝐶=𝐾𝐾−01𝒫𝒫 − (𝑦𝑦 𝒫𝒫 , ) (2) ( , ) = 𝑅𝑅 − 1 𝐶𝐶 − 1 ( , ( ) = 𝑛𝑛, 1𝑖𝑖=1𝑛𝑛 () ( ) ( ) ) , ( 2 ) 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 (3) 2 ( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( , ) (2) ( , ) = 𝑛𝑛 ( ( ) 𝑥𝑥𝑥𝑥−1(+𝑖𝑖 𝑦𝑦𝑦𝑦))2−1 ,+ 𝑗𝑗 𝑥𝑥 ∈ (3) 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥𝑜𝑜 𝑦𝑦 𝑥𝑥𝑥𝑥𝑜𝑜 −1𝒹𝒹+𝑖𝑖(𝑦𝑦𝑦𝑦2∑−1+∑𝑗𝑗 𝑜𝑜𝐶𝐶𝐶𝐶𝐶𝐶 𝑜𝑜 ) 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑𝔼𝔼𝐶𝐶𝐶𝐶𝐶𝐶𝑑𝑑 𝑦𝑦 𝑖𝑖∑=0 𝑗𝑗=𝑑𝑑)0 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 2 𝑥𝑥𝑥𝑥[−01,1+]𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 𝑜𝑜 𝑜𝑜 𝑖𝑖=1 𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 ) 𝑖𝑖 = 0 𝑗𝑗=0𝔼𝔼( 𝑑𝑑, )𝑦𝑦=Journal [0,𝑛𝑛1𝑅𝑅]− of∑1 𝐶𝐶ICT,−1𝑑𝑑 18,𝒫𝒫 No.− 2𝑦𝑦 (April)𝒫𝒫 2018, pp: 233 – 269 (2) ( ) ( , ) 𝑖𝑖=0 𝑗𝑗=0[0,1] 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , = 𝑛𝑛𝑑𝑑𝑖𝑖𝑑𝑑𝑛𝑛=𝑑𝑑𝑑𝑑1(𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑( ) ( )) , (3) 2 𝑥𝑥 𝑑𝑑∈𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥 ∈ 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−21+𝑗𝑗 𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑜𝑜𝑦𝑦 𝑜𝑜 𝒹𝒹 ([01∑,1] ∑ 𝐶𝐶𝐶𝐶𝐶𝐶𝑜𝑜 𝑜𝑜 ) 𝑥𝑥̃ ∈ 1 𝔼𝔼 𝑑𝑑 𝑦𝑦 𝑖𝑖=∑0 𝑗𝑗=0𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 ( , ) = [𝑛𝑛0,𝑖𝑖1=𝑛𝑛]1 ( ( ) ( )) , 1 [ 0 , 1 ] (3) ( , ) =For a𝑛𝑛 given( ( input) ( 2) ) ,𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 , DAE adulterates (3) x into 2 [ 0 , 1 ] ( , 2) = 𝑛𝑛 ( ( ) ( )) , (3) with some random𝑜𝑜 𝑜𝑜 𝑥𝑥 ∈ noise. 2 𝑑𝑑 It𝑑𝑑 𝑑𝑑𝑑𝑑 𝑜𝑜 is 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 added 𝑜𝑜 with a certain 2 probability= ( | , 𝑑𝑑 ) using𝑑𝑑𝑑𝑑𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 a (3) 𝑜𝑜 𝑜𝑜 𝔼𝔼𝑜𝑜 𝑑𝑑 𝑦𝑦 𝑥𝑥𝑜𝑜 ∈ 𝑑𝑑∑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝒫𝒫 − 𝑦𝑦 𝒫𝒫 𝑥𝑥̃ ∈ 2 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫1[0𝑛𝑛,1𝑖𝑖=]1 𝑜𝑜 𝑜𝑜 𝑜𝑜 𝑜𝑜 stochastic𝑛𝑛 𝑖𝑖=1 mapping.𝑥𝑥̃ ∈ 𝔼𝔼 𝑑𝑑 𝑦𝑦 ∑ 𝑑𝑑 𝒫𝒫 − 𝑦𝑦 𝒫𝒫 ( , ) = [0,1]𝑛𝑛 ( ( ) ( )) , 𝑛𝑛 𝑖𝑖 = 1 𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ (3) 2 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 [ 0 , 1 𝑑𝑑] 𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 2 = ( | , ) (3) 𝑜𝑜 𝑜𝑜 𝑥𝑥 ∈=[0,(1 ] | ,𝑜𝑜 ) 𝑜𝑜 (3) 𝔼𝔼 𝑑𝑑 𝑦𝑦 𝑥𝑥̃ ∈ ∑ 𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑 𝒫𝒫 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 − 𝑦𝑦 𝒫𝒫 = ( + ) (3) (4) [0,1] 𝑛𝑛 𝑖𝑖=1 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 [0,1] 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥̃ ∈ 𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘ 𝑥𝑥̃ ∈ [𝔇𝔇0,1]𝑥𝑥̃ 𝑥𝑥 ℘ 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑1 1 𝑥𝑥 ∈ = ( | , ) 𝑦𝑦 ℱ ℳ 𝑥𝑥 ̃ ℬ (3) The type of distribution is𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑regulated𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 by the distribution𝑥𝑥 ∈ of the original input Denoising Autoencoder (DAE)[0=,1 ] ( | , ) = ( (3+) ) (4) x and the kind of arbitrary𝑥𝑥=̃ ∈ ( noise+ added) to it. In practical cases, binomial (4) noise 𝑥𝑥̃ [ 0 𝔇𝔇 , 1 ] 𝑑𝑑 𝑑𝑑𝑥𝑥̃ 𝑑𝑑𝑑𝑑 𝑥𝑥 𝑑𝑑 𝑑𝑑℘ 𝑑𝑑𝑑𝑑𝑑𝑑 = ( + ) (5) [0The,1] DAE expands the conventional autoencoder alongside some stochastic augmentations keeping in is used for black 𝑥𝑥and∈ 𝑥𝑥̃ white𝔇𝔇𝑑𝑑𝑑𝑑𝑥𝑥𝑑𝑑𝑑𝑑 ̃images,𝑥𝑥𝑑𝑑𝑑𝑑℘𝑑𝑑𝑑𝑑𝑑𝑑 whereas for[ 0color,1] images𝑦𝑦 ℱuncorrelatedℳ1𝑥𝑥̃ ℬ1 mind 𝑑𝑑 the𝑑𝑑 𝑑𝑑𝑑𝑑 end 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑑𝑑 goal to attain =the ability1( | to, 1reproduce ) the native image from its noisy form (Vincent ( et3 )al., Gaussian noise 𝑦𝑦is 𝑥𝑥̃better∈ℱ ℳ suited.𝑥𝑥̃ ℬ However, the zero masking𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 (binomial)2 2 noise 𝑥𝑥 ̃ ∈ 2008) . This noise is usually= included( by+ physically) utilizing deterministic 𝑧𝑧distribution. ℊ ℳ The 𝑦𝑦 architecture (4ℬ) of =[0,1( ] + ) 𝑥𝑥 ̃ ∈ = ( (4) + ) (5) as wellthe DAE as isGaussian demonstrated noise 𝑥𝑥̃in Fig.𝔇𝔇 were2. 𝑥𝑥̃ 𝑥𝑥 applied℘ throughout the experiments here. Then, = (= (1 +| , )1 ) ( 5 ) (3) was= mapped( | , )to a 𝑦𝑦 underlying ℱ ℳ 𝑑𝑑 𝑑𝑑 𝑑𝑑𝑑𝑑𝑥𝑥̃ hidden𝑑𝑑 𝑑𝑑 ℬ𝑑𝑑𝑑𝑑𝑑𝑑 representation ( 3 y) by means1 of a nonlinear For a given input 1, DAE adulte 1 rates into = ( | with, )some random 2 noise. 2 (3) 𝑥𝑥 ̃ 𝑦𝑦 ∈ ℱ ℳ2 𝑥𝑥 ̃ 2 ℬ (x, z) = 𝑧𝑧 𝑛𝑛 ℊ( ℳ 𝑦𝑦 ) ℬ (6) deterministic It is added with function a certain 𝑧𝑧 =probabilityℊ𝑥𝑥̃ ℳ .(𝔇𝔇𝑦𝑦 𝑥𝑥̃ using𝑥𝑥+ℬ℘ a )stochastic mapping. 2 (4) 𝑥𝑥 ̃ 𝔇𝔇 𝑥𝑥 ̃ 𝑥𝑥 ℘ = ( + ) (5) 2 == ( ( | +, ) ) 𝑥𝑥 ̃ Δ 𝔇𝔇 𝑥𝑥 ̃ 𝑥𝑥 ℘ ∑ 𝑥𝑥 𝑖𝑖( −5 ) 𝑧𝑧(3𝑖𝑖 ) 𝑦𝑦 ℱ ℳ1𝑥𝑥̃ ℬ1 1 = ( 2 + 2 ) 𝑛𝑛 𝑖𝑖 = 1 (4) = The( type of+ distribution) 𝑧𝑧 is ℊ1 regulated ℳ 𝑦𝑦 by ℬ the distribution of the ( original4 ) input ( anx,dz the) = kind of arbitrary𝑛𝑛 ( (4) ) (6) (x, z) = 𝑛𝑛 (2 2) ( ) (26) noise added to it. In practical 𝑧𝑧 𝑥𝑥̃ ℊ cases, 𝔇𝔇 ℳ 𝑥𝑥 ̃ binomial 𝑦𝑦 𝑥𝑥 ℘ ℬ noise is used for black= and white +images, whereas for color 2 (4) =2 ( 1 + 1) (5) 𝑖𝑖 𝑖𝑖 images1 uncorrelated1 Gaussian𝑦𝑦 ℱ noiseℳ is𝑥𝑥̃ betterℬ suited.2 However, the zero maskingΔ (binomial) noise∑ as well𝑥𝑥 − 𝑧𝑧 𝑦𝑦 In ℱtheℳ very𝑥𝑥̃ sameℬ way as in1 the traditional𝑖𝑖 𝑖𝑖 autoencoder, this1 hidden1 representation as Gaussian noiseΔ were applied∑ throughout𝑥𝑥 − 𝑧𝑧 the experiments 𝑦𝑦here. Then,ℱ ℳ was𝑥𝑥̃ mappedℬ to a𝑛𝑛 underlying𝑖𝑖=1 ( x , z)== 𝑛𝑛(1𝑖𝑖=1𝑛𝑛2 (+ 2) ) dimension ((64)) then hidden mapped representation to the 𝑧𝑧 byreconstructed= meansℊ2(ℳ of a𝑦𝑦 nonlinear+ℬ feature,) deterministic z ∈ function [0.1] . of by original (5) input = ( + ) ( x , z ) = 𝑛𝑛 ( )2 ( 5 ) (6) 2 1 𝑖𝑖 1 𝑖𝑖 applying another Δ nonlinear𝑦𝑦 ℱ ℳ ∑ deterministic 𝑥𝑥 ̃ 𝑥𝑥 ℬ − 𝑧𝑧 2 function = .( + ) (5) 2 2 2 2Δ 𝑧𝑧 ℊ𝑛𝑛1ℳ𝑖𝑖=∑1𝑦𝑦 𝑥𝑥𝑖𝑖ℬ− 𝑧𝑧𝑖𝑖 𝑧𝑧 ℊ ℳ 𝑦𝑦 ℬ (x, z) = 𝑛𝑛 𝑖𝑖𝑛𝑛=1( ) 2 2 (6) In the very same way as= in the2( traditional+ autoencoder,) this 𝑧𝑧 hidd en ℊ representation ℳ 𝑦𝑦 ℬ then mapped (5) to the 2 (5) reconstructed feature, 1 𝑖𝑖of by 𝑖𝑖original input applying another nonlinear deterministic 1 Δ ∑2 𝑥𝑥 −2 𝑧𝑧 function . (x,𝑧𝑧z) =ℊ ℳ𝑖𝑖=𝑛𝑛𝑦𝑦1( ℬ ) (6) (x, z) = 𝑛𝑛 ( ) 2 𝑛𝑛 (6) 1 The2 construction error was assessed by 2 computing (x, z) = the 𝑛𝑛 ( squared) error ∆ (6) 2 2 Δ ∑ 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 Δ between∑ input𝑥𝑥𝑖𝑖 − x𝑧𝑧 and𝑖𝑖 the reconstructed feature representation z. This2 is defined The construction error was assessed1𝑛𝑛 𝑖𝑖= 1 by computing the Δ mean squared error∑ 𝑥𝑥between𝑖𝑖 − 𝑧𝑧𝑖𝑖 input and the as: 𝑛𝑛 𝑖𝑖 = 1 (x, z) = 𝑛𝑛 ( ) (6) reconstructed feature 2 representation . 𝑛𝑛 This𝑖𝑖=1 is defined 2 as: 𝑖𝑖 𝑖𝑖 Δ ∑ 𝑥𝑥 − 𝑧𝑧 𝑛𝑛 𝑖𝑖=1

Figure 1. CNN architecture for classification. Figure 1. CNN architecture for classification. 7

241 ( , ) = ( 𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌 ( , ) ( , ) + ) (1)

𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥 𝑦𝑦 𝒻𝒻 ∑ ∑ 𝐾𝐾 𝑖𝑖 𝑗𝑗 ∗ 𝐼𝐼𝐼𝐼𝐼𝐼 𝑥𝑥+𝑖𝑖 𝑦𝑦+𝑗𝑗 𝔅𝔅 𝑖𝑖=1 𝑗𝑗=1

𝔅𝔅

𝒻𝒻

×

𝐾𝐾𝒽𝒽 𝐾𝐾𝓌𝓌

( , ) = 𝑅𝑅−1 𝐶𝐶−1 ( , ) (2)

𝑆𝑆𝑆𝑆𝑆𝑆 𝑥𝑥 𝑦𝑦 𝒹𝒹 (∑ ∑ 𝐶𝐶𝐶𝐶𝐶𝐶 𝑥𝑥𝑥𝑥−1+𝑖𝑖 𝑦𝑦𝑦𝑦−1+𝑗𝑗 ) 𝑖𝑖=0 𝑗𝑗=0

1 ( , ) = 𝑛𝑛 ( ( ) ( )) , (3) 2 2 𝔼𝔼 𝑑𝑑𝑜𝑜 𝑦𝑦𝑜𝑜 ∑ 𝑑𝑑𝑜𝑜 𝒫𝒫 − 𝑦𝑦𝑜𝑜 𝒫𝒫 𝑛𝑛 𝑖𝑖=1

[0,1] 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥 ∈

[0,1] 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑥𝑥̃ ∈

= ( | , ) (3)

𝑥𝑥̃ 𝔇𝔇 𝑥𝑥̃ 𝑥𝑥 ℘

= ( + ) (4)

𝑦𝑦 ℱ ℳ1𝑥𝑥̃ ℬ1

= ( + ) (5) Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 𝑧𝑧 ℊ ℳ2𝑦𝑦 ℬ2

1 (x, z) = 𝑛𝑛 ( ) (6) (6) 2 2 Δ ∑ 𝑥𝑥𝑖𝑖 − 𝑧𝑧𝑖𝑖 𝑛𝑛 𝑖𝑖=1 The main aim of this reconstruction process is to minimize the construction error and this is done by optimizing the model parameters in such a way that:

( , , , ) = arg , , , ( , ) (7) (7) The main aim of this reconstruction process is to minimize the construction error and this is done by 1 2 1 2 ℳ1 ℳ2 ℬ1 ℬ2 For our𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 experiment,optimizingℳ the theℳ model DAEℬ parameters ℬwas trained in such𝑚𝑚𝑚𝑚 with a 𝑚𝑚way images that: corruptedΔ 𝑥𝑥 𝑧𝑧 by 20% noise.

Convolutional Denoising = Autoencoder+ (CDAE) (8) 𝑘𝑘 𝑘𝑘 𝑘𝑘 The fundamental contrastℎ between𝜚𝜚 𝑥𝑥 ∗ 𝜔𝜔 CDAEℬ (Masci et al., 2011) and conventional For our experiment, the (DAE was trained with) images corrupted by 20% noise. autoencoders is unlike = others. CDAE shares+ weights among all positions in (9) the input andConvolutional consequently Denoising it conserves Autoencoder𝑘𝑘 𝑘𝑘 spatial (CDAE) locality. Subsequently, the (consequent, , reconstruction, ) = 𝑦𝑦arg𝜚𝜚 (process∑ ,ℎ is,∗ finished𝜔𝜔,̃ (𝛽𝛽), by) a linear combination (7) of all- The fundamental contrast𝑛𝑛∈ 𝐻𝐻between CDAE (Masci et al., 2011) and conventional autoencoders is unlike important1 2 IMAGE1 2 PATCHESℳ on1 ℳ the2 ℬ 1premiseℬ2 of the latent code. For a single 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬothers.ℬ( CDAE, shares, 𝑚𝑚𝑚𝑚 ,𝑚𝑚 weights) = among arg allΔ positions𝑥𝑥 𝑧𝑧 th in the input( , and) consequently it conserves (7) spatial ( , , , )channel= arg input x, the, latent, (representation, ) of the k , feature (, 7), map would be: locality. Subsequently, the1 consequent reconstruction process is finished by a linear combination of all- 1 2 1 2 1 2 1 2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 important ℳ 1 ℳ ℳ 2 ℬIMAGE1 1(ℳℬ2,2 )ℬPATCHES=1 ℬ2 𝑛𝑛 on( the p𝑚𝑚𝑚𝑚remise)𝑚𝑚ℳ of ℳ the ℬ latent ℬ Δcode. 𝑥𝑥 𝑧𝑧For a single channel ( 10input) the latent 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ ℬ 𝑚𝑚𝑚𝑚=𝑚𝑚 +Δ 𝑥𝑥th 𝑧𝑧 2 (8) (8) representation of the k feature map would 2be: ( , , , ) = arg 𝑖𝑖 𝑖𝑖 ( , ) (7) 𝑘𝑘 𝜀𝜀𝑘𝑘𝑥𝑥 𝑦𝑦 𝑘𝑘 ∑, 𝑥𝑥, −, 𝑦𝑦 ℎ 𝜚𝜚 ( 𝑥𝑥 ∗ 𝜔𝜔 ℬ= ) 𝑛𝑛 𝑖𝑖=1 + (8) = where1+ 2 denotes 1 2 the bias, represents ℳ 1 ℳ 2 ℬ 1 ℬ the 2 activation (8) function and the 2-D 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ= whereℬ ℬ denotes the𝑚𝑚𝑚𝑚+ bias,𝑚𝑚 represents the Δ activation 𝑥𝑥 𝑧𝑧 function and the 2 - D ( 9convolution) is symbolized by 𝑘𝑘 convolution𝑘𝑘 𝑘𝑘 is symbolized𝑘𝑘 by *. The𝑘𝑘 scaled𝑘𝑘 hyperbolic tangent activation ℎ 𝜚𝜚(𝑥𝑥 ∗function𝜔𝜔 ℬ and). The a scaled single𝑘𝑘 ℎ hyperbolic 𝑘𝑘 bias𝜚𝜚( 𝑥𝑥 were tangent∗ 𝜔𝜔 used activationℬ for) function every and latent a single map bias duringwere used the for every latent map = 𝑦𝑦 + 𝜚𝜚 =during ( ∑ the ℎ experiments. ∗ 𝜔𝜔= +̃ 𝛽𝛽 The ) reconst ruction + was achieved ( 9 ) by applying: ( 8 ) (9) experiments. 𝑛𝑛The∈𝐻𝐻 reconstruction was achieved by applying: 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑦𝑦 𝜚𝜚 ( ∑ ℎ ∗ 𝜔𝜔̃ ℎ 𝛽𝛽)𝜚𝜚((𝑥𝑥 )∗=𝜔𝜔𝑦𝑦 𝜚𝜚ℬ()∑, )ℎ ∗=𝜔𝜔̃ 𝛽𝛽()| ) ( ) (11) 𝑛𝑛∈𝐻𝐻 1 𝑛𝑛∈𝐻𝐻 = 𝜃𝜃 𝜃𝜃+ 𝜃𝜃 (9) ( , ) =𝓅𝓅 𝑥𝑥 𝑛𝑛 ( ∫ 𝓅𝓅 )𝑥𝑥 𝑦𝑦 𝑑𝑑 𝑑𝑑 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑 𝑑𝑑 (10) (9) 2 𝑘𝑘 𝑘𝑘 2 1 𝑦𝑦 𝜚𝜚 (∑ ℎ ∗𝑖𝑖 𝜔𝜔̃ 𝑖𝑖 𝛽𝛽1) 𝜀𝜀 𝑥𝑥 𝑦𝑦 𝑛𝑛∈𝐻𝐻∑ 𝑥𝑥 − 𝑦𝑦 ( , ) = 𝑛𝑛 ( ) ( , ) = 𝑛𝑛 ( ) ( 10 ) (10) 2 𝑛𝑛 𝑖𝑖=1 2 ( | ) 2 2 𝑖𝑖 𝑖𝑖 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑ 𝑥𝑥𝑖𝑖 − 𝑦𝑦𝑖𝑖 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑ 𝑥𝑥 − 𝑦𝑦 𝑛𝑛 𝑖𝑖=1 1 𝓆𝓆𝑛𝑛𝜙𝜙𝑖𝑖=𝑦𝑦1 𝑥𝑥 ( , ) = 𝑛𝑛 ( ) (10) 2 2 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑( 𝑥𝑥| 𝑖𝑖 )−=𝑦𝑦𝑖𝑖 ( ( ), ( ( ))) (12) ( ) = ( , 𝑛𝑛) 𝑖𝑖=1 = ( | ) ( ) (11) 2 𝜙𝜙 𝜙𝜙 𝜙𝜙 ( ) = ( 𝓅𝓅 , 𝜃𝜃 ) 𝑥𝑥 = ∫ 𝓅𝓅 𝜃𝜃 (𝑥𝑥|(𝑦𝑦𝓆𝓆))𝑑𝑑=𝑑𝑑(𝑦𝑦)𝑥𝑥∫ 𝓅𝓅 ( 𝜃𝜃 𝒩𝒩 , 𝑥𝑥 ) 𝜇𝜇𝑦𝑦 𝓅𝓅 𝑥𝑥= 𝑦𝑦 𝑒𝑒 𝑑𝑑 𝑒𝑒𝑒𝑒 𝑑𝑑 ( 𝜎𝜎 | 𝑥𝑥 )(11( )) (11)

𝓅𝓅𝜃𝜃 𝑥𝑥 ∫ 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅𝜃𝜃𝓅𝓅𝑥𝑥𝜃𝜃 𝑦𝑦𝑥𝑥 𝓅𝓅 𝑦𝑦∫𝑑𝑑𝓅𝓅𝑑𝑑𝜃𝜃 𝑥𝑥log𝑦𝑦 𝑑𝑑𝑑𝑑( | ∫) 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 ( | ) ( ) = ( , ) = ( | ) ( ) (11) 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 Figure 2.Figure DAE𝓆𝓆 2 𝜙𝜙.architecture DAE𝑦𝑦 𝑥𝑥 architecture for for image( image| ) denoising.denoising. 𝓅𝓅𝜃𝜃 (𝑥𝑥| ) ∫ 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑

( , ) = ~ ( | )[ 𝜙𝜙 ( | )] + ( ( | )|| ( )) (13) 𝓆𝓆 𝜙𝜙 𝑦𝑦 𝑥𝑥( | ) = ( ( ), ( 𝓆𝓆 ( 𝑦𝑦))𝑥𝑥) (12) 8

𝜙𝜙 𝑖𝑖 ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 −𝐸𝐸𝑦𝑦 𝓆𝓆( |𝑦𝑦 )𝑥𝑥 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅2 𝜃𝜃 𝑥𝑥𝑖𝑖242𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝓅𝓅𝜃𝜃 𝑦𝑦 𝜙𝜙 𝜙𝜙 𝜙𝜙 ( | ) = ( 𝓆𝓆 ( )𝑦𝑦 , 𝑥𝑥 ( 𝒩𝒩 ( 𝜇𝜇))() 𝑥𝑥 | ) 𝑒𝑒 =𝑒𝑒𝑒𝑒 𝜎𝜎 ( 𝑥𝑥 ( ) , ( ( ()12)) ) (12) 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 2 2 𝜙𝜙 𝜙𝜙 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 𝒩𝒩 𝜇𝜇𝜙𝜙 𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎log𝜙𝜙 𝑥𝑥𝓆𝓆 (𝑦𝑦|𝑥𝑥 ) 𝒩𝒩 𝜇𝜇 𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎𝜙𝜙 𝑥𝑥 = 𝑁𝑁 ( , ) (14) ( | ) = ( ( ), ( ( ))) (12) log ( | ) 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 log 2 ( 𝑖𝑖| ) 𝜙𝜙 𝜙𝜙 𝐿𝐿 ∑𝜙𝜙 ℓ 𝜙𝜙 𝜃𝜃 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝒩𝒩 𝜇𝜇 𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎𝑖𝑖=1𝑥𝑥 𝜃𝜃 𝜃𝜃 ( , ) = 𝓅𝓅 ~𝑥𝑥 𝑦𝑦( | )[ ( | )] + 𝓅𝓅( 𝑥𝑥( 𝑦𝑦| )|| ( )) (13) log ( | ) 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 ( , ) = ~ ℓ ( 𝜙𝜙 | 𝜃𝜃 ) [ − 𝐸𝐸 ( ( | , )])+=𝑙𝑙𝑙𝑙𝑙𝑙(𝓅𝓅~ (𝑥𝑥(| 𝑦𝑦| ))|[| 𝐾𝐾𝐾𝐾( )𝓆𝓆)( | 𝑦𝑦 ) 𝑥𝑥 ] + 𝓅𝓅 ( 𝑦𝑦 (13( )| )|| ( )) (13) 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 𝑖𝑖 𝑦𝑦 𝜙𝜙𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝑖𝑖 𝜃𝜃 𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 − 𝐸𝐸 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅 ℓ𝑥𝑥𝜙𝜙𝑦𝑦𝜃𝜃 𝐾𝐾𝐾𝐾−𝐸𝐸𝓆𝓆 𝑦𝑦 𝑥𝑥 𝑙𝑙𝑙𝑙𝓅𝓅𝑙𝑙 𝓅𝓅𝑦𝑦 𝑥𝑥 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 ( ) [ ( | )] ( | ) , = ~ ( | ) = 𝑁𝑁 ( , )+ ( | | ( ) ) ((1413))

𝜙𝜙 𝑖𝑖 ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 −𝐸𝐸𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅𝜃𝜃 𝑥𝑥𝑖𝑖𝑖𝑖 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝓅𝓅𝜃𝜃 𝑦𝑦 = ( , ) 𝐿𝐿 ∑ ℓ 𝜙𝜙 𝜃𝜃 = 𝑁𝑁 ( , ) ( 14 ) (14) 𝑁𝑁 𝑖𝑖=1 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑖𝑖=1 𝑖𝑖=1 = 𝑁𝑁 ( , ) (14)

𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑖𝑖=1 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

( , , , ) = arg ( , ) (7) ( , ,, , , , ) = arg , , , ( , ) (7) 1 2 1 2 ℳ1 ℳ2 ℬ1 ℬ2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ ℬ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ𝑚𝑚𝑚𝑚1𝑚𝑚ℳ2 ℬ1 ℬ2 Δ 𝑥𝑥 𝑧𝑧 𝑚𝑚𝑚𝑚𝑚𝑚ℳ1 ℳ2 ℬ1 ℬ2 Δ 𝑥𝑥 𝑧𝑧 ( , , , ) = arg , , , ( , ) (7)

1 2 1 2 = + 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 = ℳ 1 ℳ 2 + ℬ 1 ℬ 2 𝑚𝑚𝑚𝑚 𝑚𝑚 ℳ ( 8 )ℳ ℬ ℬ Δ 𝑥𝑥 𝑧𝑧 (8) Figure 3. CDAE architecture for image denoising. 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 𝑘𝑘 ℎ 𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 ℬ ) ℎ 𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 ℬ ) = + (8) = + = + ( 9 ) (9) As in the previous step, for every input channel, one𝑘𝑘 bias was𝑘𝑘 also used𝑘𝑘 here Figurealso. 3. CDAEH denotes architecture the group for𝑘𝑘 ofimage 𝑘𝑘underlying denoising feature. 𝑘𝑘 ℎmaps,𝑘𝑘𝜚𝜚 the(𝑥𝑥 ∗flip𝜔𝜔 operationℬ ) over 𝑦𝑦 𝜚𝜚 (∑ ℎ ∗ 𝜔𝜔̃ 𝛽𝛽𝑦𝑦) 𝜚𝜚 (∑ ℎ ∗ 𝜔𝜔̃ 𝛽𝛽) one and the other 𝑛𝑛 dimensions ∈ 𝐻𝐻 of the weights are= identified by . The+ error (9) 𝑛𝑛∈𝐻𝐻 function used here is defined as: 𝑘𝑘 𝑘𝑘 As in the previous step, for every input channel, one bias𝑦𝑦 𝜚𝜚was(∑ alsoℎ used∗ 𝜔𝜔̃ here also.𝛽𝛽) denotes the group of underlying feature maps,1 the flip operation over1 one and 𝑛𝑛 the∈𝐻𝐻 other dimensions(10) of the weights are ( , ) = 𝑛𝑛 ( ) ( , ) = 𝑛𝑛 ( ) ( 10 ) (10) identified by . The error2 function used here is defined2 as: 2 2 ( , , 𝑖𝑖 , 𝑖𝑖) = arg , , , ( , ) (7) 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑ 𝑥𝑥 − 𝑦𝑦 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑ 𝑥𝑥𝑖𝑖 − 𝑦𝑦𝑖𝑖 1 𝑛𝑛 𝑖𝑖 = 1 𝑛𝑛 𝑖𝑖 =(1 , ) = 𝑛𝑛 ( ) (10) The gradient of this1 error2 1function2 is computedℳ1 ℳ 2duringℬ1 ℬ2 2 the backpropagation 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ ℬ 𝑚𝑚𝑚𝑚𝑚𝑚 Δ 𝑥𝑥 𝑧𝑧 2 (Liu et al., 2015; Bouvrie, 2006) step. The overall𝜀𝜀 𝑥𝑥 𝑦𝑦architectural∑ 𝑥𝑥 description𝑖𝑖 − 𝑦𝑦𝑖𝑖 is illustrated in Fig. 3. The convolution operation employed𝑛𝑛 here𝑖𝑖=1 were uniform The gradient of this error function is computed during the backpropagation (Liu et al., 2015; Bouvrie, to the convolution operation = depicted+ in the CNN section. Amid the training (8) 2006) step. The overall architectural description is illustrated in Fig. 3. The convolution operation period( )the native image𝑘𝑘 was utilized𝑘𝑘 as the𝑘𝑘 output label with a specific end goal employed here = were uniform ( , ℎ) to =the𝜚𝜚(( convolution𝑥𝑥)∗=𝜔𝜔( | )ℬ( operation(), )) = depicted ( in | the ) CNN ( ) section ( 11 ) . Amid the training (11) to update the kernel weights and different parameters so that in times of testing period the native image was= utilized as the output+ label with a specific end goal to update (9 the) kernel the𝓅𝓅𝜃𝜃 CDAE𝑥𝑥 ∫could𝓅𝓅𝜃𝜃 reproduce𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 𝜃𝜃 a∫ noise-omitted𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝜃𝜃𝓅𝓅 𝑦𝑦 picture𝑑𝑑𝑑𝑑 given𝜃𝜃 a noise-injected one. weights and different parameters 𝓅𝓅 so 𝑥𝑥 that 𝑘𝑘 ∫ in 𝓅𝓅 times 𝑘𝑘 𝑥𝑥( 𝑦𝑦of) =𝑑𝑑testing𝑑𝑑 ∫ the(𝓅𝓅 ,CDAE)𝑥𝑥 𝑦𝑦 = 𝓅𝓅could𝑦𝑦 𝑑𝑑 reproduce𝑑𝑑( | ) ( a )noise - omitted (11) In this experiment, the CDAE was trained with 20% noisy images. picture given a noise-injected𝑦𝑦 one.𝜚𝜚 ( ∑In thisℎ experiment,∗ 𝜔𝜔̃ 𝛽𝛽) the CDAE was trained with 20% noisy images. 𝑛𝑛∈𝐻𝐻 𝜃𝜃 𝜃𝜃 𝜃𝜃 ( | ) 𝓅𝓅 𝑥𝑥 ( ∫| 𝓅𝓅) 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 DenoisingDenoising Variational Variational Autoencoder Autoencoder 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 1 𝜙𝜙 The d enoising variational ( autoencoder, ) = 𝑛𝑛(D(VAE) ) (Ciresan𝓆𝓆 𝑦𝑦 𝑥𝑥 et al., 2011c; ( | )Kingma and ( Welling,10) 2013), a The denoising variational autoencoder2 (DVAE) (Ciresan et al., 2011c; Kingma modern variant of AE, is a deep directed graphical2 model that interprets the output of the encoder by and Welling, 2013), a modern variant 𝑖𝑖of AE,𝑖𝑖 is a deep directed graphical model means of variational( | ) =inference. 𝜀𝜀 (𝑥𝑥 𝑦𝑦( There), are∑( ( basically|𝑥𝑥()−)𝑦𝑦)) three ( compone ) nts 𝜙𝜙 as the building (12 block) of a DVAE: an that interprets the output of the𝑛𝑛 𝑖𝑖 =encoder1 = by( means, of variational𝓆𝓆 ( (𝑦𝑦 𝑥𝑥))) inference. (12) encoder, the following decoder and finally a loss function. The structure of the DVAE used all through There are basically three components2 as the building block2 of a DVAE: an this experiment𝓆𝓆𝜙𝜙 𝑦𝑦 is𝑥𝑥 demon𝒩𝒩strated𝜇𝜇𝜙𝜙 𝑥𝑥 in 𝑒𝑒Fig.𝑒𝑒𝑒𝑒𝜙𝜙 𝜎𝜎4.𝜙𝜙 Both𝑥𝑥 the encoder𝜙𝜙 and the𝜙𝜙 decoder can be any variant of the encoder, the following decoder 𝓆𝓆 and 𝑦𝑦 𝑥𝑥 finally 𝒩𝒩 𝜇𝜇a( loss|𝑥𝑥) function.=𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎( 𝑥𝑥(The), structure ( ( of)) ) (12) neural network. It computes probability distribution and thus finds out the probability distributionthe DVAE of data used allby employingthrough this the experiment following equation is demonstrated: in Fig. 4. Both2 the log ( | ) 𝜙𝜙 𝜙𝜙 𝜙𝜙 encoder and the decoder can be any variantlog 𝓆𝓆of the𝑦𝑦( 𝑥𝑥neural| ) 𝒩𝒩 network.𝜇𝜇 𝑥𝑥 It𝑒𝑒 𝑒𝑒𝑒𝑒computes𝜎𝜎 𝑥𝑥 probability distribution( ) = ( , ) and= thus finds( | out) the( ) probability distribution (11) 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝜃𝜃 of data x by employing the following equation:𝓅𝓅 𝑥𝑥 𝑦𝑦 log ( | ) 𝓅𝓅𝜃𝜃 𝑥𝑥 ∫ 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 9 ( , ) = ~ ( | )[( , ) =( | )] + ( [ ( | )(|| | ()])+) 𝜃𝜃 ( ( | ) |(|13)( )) (13) ~ ( | 243) 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 (𝜙𝜙 | )𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 ℓ𝑖𝑖 𝑙𝑙𝑙𝑙𝜙𝜙𝑙𝑙𝜃𝜃𝓅𝓅 𝑥𝑥−𝑦𝑦𝐸𝐸𝑦𝑦 𝓆𝓆 𝐾𝐾𝐾𝐾𝑦𝑦 𝑥𝑥𝓆𝓆 𝑙𝑙𝑙𝑙𝑦𝑦𝑙𝑙 𝓅𝓅𝑥𝑥 𝜃𝜃 𝑥𝑥𝓅𝓅𝑖𝑖 𝑦𝑦 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝓅𝓅𝜃𝜃 𝑦𝑦 ( , ) = ~ ( | )[ ( | )] + ( ( | )|| ( )) (13) 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 = 𝑁𝑁 ( , ) = ( , ) ( 14 ) (14) ( | ) = ( ( ), ( 𝑁𝑁( ))) (12) 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 2 𝑖𝑖 𝜙𝜙 𝜙𝜙 𝐿𝐿 ∑ 𝜙𝜙 ℓ 𝜙𝜙 𝜃𝜃= ( , ) (14) 𝓆𝓆 𝑦𝑦𝑖𝑖=𝑥𝑥1 𝒩𝒩 𝜇𝜇 𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎𝑖𝑖=1 𝑥𝑥 𝑁𝑁 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 log ( | ) 𝑖𝑖=1

𝓅𝓅𝜃𝜃 𝑥𝑥 𝑦𝑦

( , ) = ~ ( | )[ ( | )] + ( ( | )|| ( )) (13)

𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦

= 𝑁𝑁 ( , ) (14)

𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑖𝑖=1

( , , , ) = arg , , , ( , ) (7) ( , , , ) = arg , , , ( , ) (7) 1 2 1 2 ℳ1 ℳ2 ℬ1 ℬ2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜1 2 1ℳ2 ℳ ℬ ℳℬ1 ℳ2 ℬ1 ℬ2 𝑚𝑚𝑚𝑚𝑚𝑚 Δ 𝑥𝑥 𝑧𝑧 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ ℬ 𝑚𝑚𝑚𝑚𝑚𝑚 Δ 𝑥𝑥 𝑧𝑧 ( , , , ) = arg , , , ( , ) (7) =( , ,+ , ) = arg ( (,8) ) (7) 1 2 1=2 ℳ+1 ℳ2 ℬ1 ℬ 2 , , , (8) 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑘𝑘 ℳ ℳ 𝑘𝑘ℬ ℬ𝑘𝑘 𝑚𝑚𝑚𝑚𝑚𝑚 Δ 𝑥𝑥 𝑧𝑧 ℎ 𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 𝑘𝑘 ℬ ) 𝑘𝑘 𝑘𝑘 1 2 1 2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ 1 ℳ2(ℬ1, ℬ2, , ) = 𝑚𝑚𝑚𝑚arg𝑚𝑚ℳ ℳ, ℬ, ℬ, Δ( 𝑥𝑥, )𝑧𝑧 (7) = ℎ= +𝜚𝜚(𝑥𝑥 + ∗ 𝜔𝜔 ℬ ) ( 9 ) (8) 𝑘𝑘 𝑘𝑘1 2 1 2 ℳ1 ℳ2 ℬ1 ℬ2 ( , , , 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜) =𝑘𝑘 argℳ ℳ 𝑘𝑘ℬ ℬ 𝑘𝑘 ( , )𝑚𝑚𝑚𝑚 𝑚𝑚 Δ 𝑥𝑥 (7𝑧𝑧) 𝑦𝑦 𝜚𝜚 ( ∑ ℎ ∗=𝜔𝜔̃ 𝛽𝛽), , , + (9) 𝑛𝑛 ∈ 𝐻𝐻 ℎ =𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 ℬ +) (8) 1 2 1 2 ℳ1 ℳ2 ℬ1 𝑘𝑘ℬ2 𝑘𝑘 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ ℬ = 𝑚𝑚𝑚𝑚 𝑚𝑚 = + Δ 𝑥𝑥 + 𝑧𝑧 ( 9 ) (8) 𝑘𝑘 𝑘𝑘 𝑘𝑘 ( , , ,𝑦𝑦 ) =𝜚𝜚 arg(∑𝑘𝑘 ℎ𝑘𝑘 , ∗, 𝜔𝜔̃, ( 𝛽𝛽, )) (7) 1 𝑦𝑦ℎ 𝜚𝜚 (∑𝜚𝜚(𝑥𝑥ℎ𝑛𝑛𝑘𝑘 ∈∗∗𝐻𝐻𝜔𝜔𝜔𝜔̃ 𝛽𝛽ℬ)𝑘𝑘 ) 𝑘𝑘 ( (, , =),= , 𝑛𝑛) =(+ arg ) ℎ ,𝜚𝜚 ( 𝑥𝑥 , ∗ , 𝜔𝜔 ( ℬ , ) ( 10 ) ( 8 ) (7) 1 22 1 2 𝑛𝑛∈𝐻𝐻 ℳ1 ℳ2 ℬ1 ℬ2 ( 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 , , ℳ , ℳ ) = ℬ arg ℬ = ( 2=,𝑚𝑚𝑚𝑚(𝑚𝑚, (,, , )), += , Δarg + 𝑥𝑥 ) = 𝑧𝑧 arg , , ( ,7 ) , (, , , ) ( , ) ( 9 ) ( 7 ) ( 9 ) ( 7) 1 𝑘𝑘 2 1 2 𝑘𝑘 𝑖𝑖 𝑘𝑘 𝑖𝑖, , ℳ1, ℳ2 ℬ1 ℬ2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ𝜀𝜀ℎ𝑥𝑥ℳ𝑦𝑦 𝜚𝜚ℬ(𝑥𝑥 ℬ∗ ∑𝜔𝜔 𝑥𝑥 −ℬ 𝑦𝑦)𝑚𝑚𝑚𝑚𝑚𝑚 𝑘𝑘 Δ 𝑥𝑥𝑘𝑘 𝑧𝑧 𝑛𝑛 𝑖𝑖=1 1 12 1 2𝑘𝑘2 1 2𝑘𝑘 ℳ1 ℳ2 ℬ1 ℬ12 2 1 2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ 1 ℳ 2 ℬ=1 ℬ2 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑚𝑚𝑚𝑚𝑚𝑚ℳ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜+1ℳ𝑦𝑦ℳ ℬ ℳ1 𝜚𝜚 ℬ ( ℳ ∑ ℬ Δ 1 ℳ 𝑥𝑥ℎℬ 𝑧𝑧2 ∗ ℬ 𝜔𝜔 ̃ 1 ℬ 2 𝛽𝛽 𝑚𝑚𝑚𝑚 ) 𝑚𝑚 𝑚𝑚𝑚𝑚 𝑚𝑚 ℳ ℳ(Δ9)ℬ𝑥𝑥 ℬ𝑧𝑧 Δ 𝑥𝑥 𝑧𝑧 = (𝑦𝑦 , ) =𝜚𝜚 (∑+𝑛𝑛 (ℎ ∗ ) 𝜔𝜔̃ 𝛽𝛽 ) ( (810) ) ( , )2= 𝑛𝑛𝑛𝑛∈𝐻𝐻( ) (10) =𝑘𝑘 𝑘𝑘 𝑘𝑘+𝑘𝑘 𝑛𝑛∈ 𝐻𝐻 𝑘𝑘 2 (8) 2 𝑖𝑖 𝑖𝑖 ( , , , 𝑦𝑦 ) = 𝜚𝜚 𝑘𝑘 arg ( ℎ ∑ 𝜀𝜀 ℎ 𝜚𝜚 𝑥𝑥 ( ∗ 𝑦𝑦𝑥𝑥 𝜔𝜔 ,̃𝑘𝑘 ∗ 𝜔𝜔 , , 𝛽𝛽𝑘𝑘 ∑ ) ℬ (𝑥𝑥 ) , − = ) 𝑦𝑦 = + 2 + ( 7 ) ( 8 ) (8) = + 𝑖𝑖 = 1 (8) 𝑛𝑛∈𝐻𝐻 𝑛𝑛 1 𝑖𝑖 𝑖𝑖 𝑘𝑘 ℎ =𝜚𝜚(𝑘𝑘𝑥𝑥𝜀𝜀∗𝑥𝑥𝜔𝜔𝑘𝑘𝑦𝑦 ℬ ) +𝑘𝑘∑ 𝑥𝑥 𝑘𝑘 − 𝑘𝑘 𝑦𝑦 𝑘𝑘 𝑘𝑘 𝑘𝑘 ((9) , , , ) = arg , , , ( , ) (7) 1 2 1 2 ℳ Journal 1 ℳ 2 ℬ of1(ℬ ICT,2, ) 18,= No. 2 𝑛𝑛(April)( 2018,) pp: 233 – 269 (10) 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ ℳ ℬ( ℎ ) ℬ = 𝜚𝜚(=𝑥𝑥 ∗( 𝜔𝜔,𝑚𝑚𝑚𝑚)𝑚𝑚 ℬ=) 𝑘𝑘 +1(𝑘𝑘Δ|ℎ𝑛𝑛𝑥𝑥 ) 𝑖𝑖 =𝑧𝑧 2 ( 1 𝜚𝜚 )( ℎ 𝑥𝑥 ∗ 𝜔𝜔 𝜚𝜚 ( 𝑥𝑥 ∗ℬ 𝜔𝜔 ) ℬ ) ( 11 (9)) ( , , , ) = arg , , ,2 ( , ) 1 (72) 1 2 ℳ1 ℳ2 ℬ1 ℬ2 = 𝑦𝑦 1 𝜚𝜚 (( ∑ , 𝑘𝑘 + ) ℎ = 𝑘𝑘 ∗ 𝜔𝜔 ̃ 𝑛𝑛= 𝛽𝛽 )( = 𝑖𝑖 ) 𝑖𝑖 + + ( 9 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ) ℳ ℳ ℬ ℬ ( 10 ( ) 9 ) 𝑚𝑚𝑚𝑚 (𝑚𝑚9) Δ 𝑥𝑥 𝑧𝑧 ( , , 𝑛𝑛, ∈𝐻𝐻 ) =𝜀𝜀 arg𝑥𝑥 𝑦𝑦 ∑ 𝑥𝑥 − 𝑦𝑦 ( , ) (7) 𝜃𝜃 ( , ) =𝜃𝜃 𝑛𝑛 ( ) 𝜃𝜃2 𝑛𝑛 , 𝑖𝑖 = 1 , 1 , 2 1 2 (10) 𝓅𝓅 = 𝑥𝑥 ∫ 𝑦𝑦 𝓅𝓅 +𝜚𝜚𝑥𝑥2(𝑘𝑘∑𝑦𝑦1 𝑑𝑑 ℎ𝑑𝑑𝑘𝑘 2 ( ∗ 1∫ 𝜔𝜔 ̃ , 𝓅𝓅 2 ,𝛽𝛽𝑥𝑥 ) 𝑦𝑦 , 𝓅𝓅 ) 𝑦𝑦 = 𝑑𝑑 ℳ𝑑𝑑 arg ℳ𝑘𝑘 2 ℬ 𝑘𝑘ℬ 𝑘𝑘 (8𝑘𝑘) ( , ) (7) 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ𝑛𝑛∈𝐻𝐻ℳ ℬ ℬ2 𝑚𝑚𝑚𝑚𝑚𝑚 ,Δ 𝑥𝑥, 𝑧𝑧, 𝑦𝑦 𝜚𝜚 (1𝑖𝑖∑2 ℎ1𝑖𝑖 2∗ 𝜔𝜔̃ 𝛽𝛽) 𝑘𝑘 𝑦𝑦 𝜀𝜀 𝜚𝜚𝑥𝑥 ( 𝑦𝑦 ∑𝑘𝑘1 (ℎ)2=𝑘𝑘𝜀𝜀∗∑𝜔𝜔̃1𝑥𝑥𝑥𝑥𝑦𝑦𝑖𝑖 2−(𝛽𝛽𝑦𝑦)𝑖𝑖, ) ∑= ℳ𝑥𝑥 𝑦𝑦ℳ−( ℬ 𝑦𝑦 |𝜚𝜚 ℬ () ∑ ( )ℎ ∗ 𝜔𝜔 ̃ 𝛽𝛽 ) ( 11)= + (8) 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ℳ𝑛𝑛∈𝐻𝐻ℳ ℬ ℬ 1 2 𝑚𝑚𝑚𝑚1 𝑚𝑚2 𝑛𝑛∈𝐻𝐻 𝑛𝑛Δ∈𝐻𝐻ℳ𝑥𝑥 1𝑧𝑧ℳ2 ℬ1 ℬ2 (11) ℎ 𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜ℬ𝑛𝑛 𝑖𝑖=)1 1ℳ ℳ 𝑛𝑛ℬ𝑖𝑖=1ℬ 𝑚𝑚𝑚𝑚𝑚𝑚 Δ 𝑥𝑥 𝑧𝑧 ( , ) = ( | =)𝑛𝑛 ( ) + ( 10 ) (8𝑘𝑘) 𝑘𝑘 𝑘𝑘 = 𝜃𝜃 +1 2 𝜃𝜃 𝜃𝜃 (9) (𝓅𝓅, )𝑥𝑥= ∫𝑛𝑛𝓅𝓅( 𝑥𝑥 𝑦𝑦 )𝑑𝑑𝑑𝑑 2 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑 𝑑𝑑 (10) ℎ 𝜚𝜚(𝑥𝑥 ∗ 𝜔𝜔 ℬ ) ( , = , 𝑘𝑘 , ) =𝑖𝑖 + arg𝑖𝑖𝑘𝑘 𝑘𝑘 1 , , , ( , ) ( 8 ) ( 7) 𝑘𝑘 𝜀𝜀1 𝑥𝑥 𝑘𝑘 𝑦𝑦 2 𝜙𝜙 ∑ 𝑥𝑥 −=𝑦𝑦 + 1 = (8) + (9) ( ) ( ) 𝓆𝓆 ( = ℎ 𝑦𝑦 𝑥𝑥 ( ) ) 𝜚𝜚 ( 𝑥𝑥 ( ∗ 2 𝜔𝜔, ( , ) ) ℬ = ( )=, )𝑛𝑛=( (𝑛𝑛 )(| ( ) ) ) ( ) ( 10 ) ( 10 ) (11) 𝑦𝑦 𝜚𝜚 ( where ,∑ = ℎ ∗ denotes 𝜔𝜔̃ 𝑛𝑛 𝑘𝑘 𝛽𝛽 the)𝑛𝑛 weights𝑖𝑖=𝑖𝑖1 = 𝑘𝑘 𝑖𝑖 and 𝑘𝑘 ( biases , ) of = the decoder, ( | ) 10 ( ) is the probability (11) 𝜀𝜀 𝑥𝑥2𝑦𝑦 ∑ 𝑥𝑥 − 𝑦𝑦 21 2 1 22 𝑘𝑘 𝑘𝑘 𝑛𝑛 ∈ 𝐻𝐻 1 2 =1 2 2 𝑘𝑘 +ℳ 𝑘𝑘ℳ ℬ 𝑘𝑘 ℬ 2 2 (9) 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜distributionℳℎ ℳ of 𝑛𝑛 the𝜚𝜚𝑖𝑖ℬ=(1𝑥𝑥 latentℬ∗ 𝜔𝜔 variable( ℬ| )𝑚𝑚𝑚𝑚𝑚𝑚 y which 𝑖𝑖 is Δ often𝑖𝑖 𝑥𝑥 𝑧𝑧 the standard normal 𝑖𝑖 𝜃𝜃 𝑖𝑖 ℎ 𝜀𝜀 𝑥𝑥𝜃𝜃𝜚𝜚𝑦𝑦( 𝑥𝑥 ∗ 𝜔𝜔 ∑ℬ𝑥𝑥 )−𝜃𝜃 𝑦𝑦 𝑖𝑖 𝑖𝑖 𝑦𝑦 𝜚𝜚 (∑ ℎ ∗ 𝜔𝜔̃ 𝛽𝛽) ( 𝜀𝜀 )𝑥𝑥 𝑦𝑦 ∑=𝑥𝑥 −𝓅𝓅 𝑦𝑦 𝑥𝑥 ∫ 𝓅𝓅𝑘𝑘 +𝑥𝑥 𝑘𝑘𝑦𝑦𝜀𝜀 𝑑𝑑𝑥𝑥 𝑑𝑑 𝑦𝑦 ∫ 𝓅𝓅 ∑ 𝑥𝑥 𝑦𝑦𝑥𝑥 𝓅𝓅− 𝑦𝑦 𝑦𝑦 𝑑𝑑 𝑑𝑑 (9) distribution=( | 𝜃𝜃) 𝑛𝑛=( 𝑖𝑖 =, 1 ) (0,( I),=( and), 𝜃𝜃 ( ( | ( ) is) (the)) ) decoder’s 𝑛𝑛 𝑖𝑖 = 1 output 𝑛𝑛 𝜃𝜃 𝑖𝑖 = 1 under noise (rumination (1112)) in 𝑛𝑛∈𝐻𝐻 𝓅𝓅 𝑥𝑥 𝑦𝑦 ∫ 𝜚𝜚 𝓅𝓅 ( ∑𝜙𝜙=𝑥𝑥ℎ 𝑦𝑦∗ 𝜔𝜔̃𝑑𝑑𝑑𝑑 𝛽𝛽)∫ +𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑 𝑑𝑑 (9) terms 1 of probability = distribution𝑘𝑘𝓆𝓆 𝑦𝑦𝑘𝑘2+𝑥𝑥 of the reconstructed data given latent features. (8) 𝜃𝜃 𝜃𝜃( ) = (𝑛𝑛∈𝜃𝜃𝐻𝐻, ) =𝑘𝑘 𝑘𝑘 ( | ) ( ) (11) (𝓅𝓅, )𝑥𝑥=𝓆𝓆 𝜙𝜙 ∫𝑦𝑦 𝑛𝑛𝑥𝑥𝓅𝓅( 𝑦𝑦𝑥𝑥 𝒩𝒩𝑦𝑦 𝜚𝜚𝑑𝑑)𝜇𝜇𝑑𝑑(𝜙𝜙 ∑ 𝑥𝑥 ∫ ℎ 𝑒𝑒 𝓅𝓅 𝑒𝑒𝑒𝑒 ∗ 𝜔𝜔𝑥𝑥 ̃𝜎𝜎 𝜙𝜙 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝛽𝛽 )𝑦𝑦 𝑑𝑑 𝑑𝑑 (10) 2 𝑘𝑘 𝑛𝑛∈𝐻𝐻 𝑦𝑦 𝑘𝑘𝜚𝜚 (∑𝑘𝑘 ℎ( ∗| 𝜔𝜔̃) 𝛽𝛽) 1 ( ) = (2 , ) = ( | ) ( ) (11) The encoder 𝑖𝑖 ℎneural(𝑖𝑖 | )𝜚𝜚 =network(𝑥𝑥 ∗ 𝜔𝜔( (takesℬ)𝑛𝑛,∈𝐻𝐻) data ( point( ) ) ) x as input and translates it to (( 12, )) = 𝑛𝑛 ( ) (10) 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑𝜃𝜃 𝑥𝑥 − 𝑦𝑦 𝜃𝜃 1 𝜃𝜃 2 ( ) 𝑛𝑛 =𝓅𝓅 𝑖𝑖 = 1 𝑥𝑥 ( ,=∫) 𝓅𝓅= 𝑥𝑥 𝑦𝑦( 𝑑𝑑| 𝑑𝑑+)( 𝜙𝜙(| ∫) ) 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑 𝑑𝑑 ( 11 ) (9) 2 a 𝜃𝜃 hidden representation log 𝜃𝜃(( |,( ))| =) y which𝑛𝑛 ( 𝜃𝜃 𝓆𝓆 has) 𝑦𝑦 significantly 2𝑥𝑥 less dimension ( 10than x). 𝑖𝑖 𝑖𝑖 ( ) = 𝓅𝓅 𝑥𝑥 ( , )∫ 𝓅𝓅𝜙𝜙 = 𝑥𝑥 𝑦𝑦 1 𝑑𝑑 ( 𝑑𝑑 ( 2 | ) ) ∫ =𝜙𝜙𝓅𝓅( () )𝑥𝑥=(𝑦𝑦 ,𝓅𝓅 )𝜙𝜙 𝑦𝑦 ( 𝑑𝑑 = 𝑑𝑑 , ) = ( | ) (((11)|) ) ( ) 𝜀𝜀 𝑥𝑥 𝑦𝑦 ∑ ( 11 𝑥𝑥 )− 𝑦𝑦 (11) As the encoder (𝓆𝓆, )𝑦𝑦= learns𝑥𝑥 𝑛𝑛 𝒩𝒩 to( 𝑘𝑘𝜇𝜇 compress𝑥𝑥𝑘𝑘) 𝑒𝑒 𝑒𝑒𝑒𝑒 the 2 𝜎𝜎 data 𝑥𝑥 into a significantly ( 10 stochastic) 𝑛𝑛 𝑖𝑖=1 𝜃𝜃 𝜃𝜃 𝜃𝜃 𝑖𝑖 1 𝑖𝑖 𝓅𝓅 𝑥𝑥 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝜃𝜃 𝑑𝑑 2𝑑𝑑 ∫( 𝓅𝓅, ) 𝑥𝑥= 𝑦𝑦 𝓅𝓅 𝑛𝑛𝑦𝑦 (𝑑𝑑𝑑𝑑 ) (10) less dimensional𝑦𝑦𝜙𝜙𝜀𝜀𝓅𝓅𝑥𝑥 𝜚𝜚space,𝑦𝑦𝑥𝑥(∑𝑦𝑦 ℎit ∑produces∗ 𝜔𝜔̃𝑥𝑥𝜙𝜙2− 𝑦𝑦𝛽𝛽 output) parameters which is a Gaussian 𝜃𝜃 𝜃𝜃 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝑛𝑛𝜃𝜃∈𝜃𝜃𝐻𝐻(𝑛𝑛𝑖𝑖|𝑖𝑖=)1𝓆𝓆𝜃𝜃=𝑖𝑖 2 𝜃𝜃𝑦𝑦( 𝑥𝑥( 𝜃𝜃), ( 𝜃𝜃( ))) 𝜃𝜃 (12) 𝓅𝓅 𝑥𝑥 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅𝓅𝓅 𝑥𝑥𝑥𝑥𝑦𝑦 𝓅𝓅 ∫𝑦𝑦𝑥𝑥𝓅𝓅𝑑𝑑𝑑𝑑 𝑥𝑥∫𝑦𝑦𝓅𝓅𝑑𝑑𝑑𝑑𝑥𝑥 𝑦𝑦2∫𝑑𝑑𝓅𝓅𝑑𝑑 𝑥𝑥∫𝑦𝑦𝓅𝓅𝓅𝓅 𝑦𝑦𝑥𝑥 𝑦𝑦𝑑𝑑𝑑𝑑𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 𝜀𝜀 𝑥𝑥 𝑦𝑦 (∑| 𝑥𝑥) − 𝑦𝑦( | ) 𝑖𝑖 𝑖𝑖 probability density 𝑛𝑛 𝑖𝑖 = log 𝜀𝜀1 𝑥𝑥 . 𝑦𝑦 ( represents| ) ∑ 𝑥𝑥 the− 𝑦𝑦 weights2 and biases of the encoder. ( ) [ ( | )𝜙𝜙] ( | 𝜙𝜙) 𝜙𝜙 , = This~( |( posterior)| =) ( ( ( ) | , 𝓆𝓆 ) is+ 𝑦𝑦 (the𝑥𝑥 (uncorrelated( ))𝒩𝒩)𝑛𝑛 𝑖𝑖 = 𝜇𝜇 1 |𝑥𝑥 | multivariate 𝑒𝑒 (𝑒𝑒𝑒𝑒 ) ) 𝜎𝜎 𝑥𝑥 normal determined ( (1213)) by ( ) = ( , ) = ( |𝜙𝜙)1 ( ) 𝜃𝜃 (11) the encoder:( | ) 𝓆𝓆 𝑦𝑦 𝑥𝑥𝓅𝓅 𝜙𝜙𝑥𝑥 𝑦𝑦 ( | ) ( | ) 𝑖𝑖 𝑦𝑦 𝓆𝓆 𝜙𝜙 𝑦𝑦 𝑥𝑥 𝑖𝑖 ( ,(𝜃𝜃𝜙𝜙) |=𝑖𝑖 ) =𝑛𝑛 𝓆𝓆 (2 (𝑦𝑦𝜙𝜙 )𝑥𝑥( 𝑖𝑖 ) , 𝜃𝜃 ( ( ) ) ) ( 10 ) (12) ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 𝜙𝜙 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅𝓆𝓆𝜙𝜙 𝑥𝑥𝑦𝑦 𝑥𝑥𝑦𝑦2 𝐾𝐾𝐾𝐾𝜙𝜙 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 ( ) = ( , ) = ( | ) ( ) (11) 𝓅𝓅 𝜃𝜃 𝑥𝑥 ∫ 𝓅𝓅𝜃𝜃𝓆𝓆 𝑥𝑥 𝑦𝑦𝑦𝑦 𝑥𝑥𝑑𝑑𝑑𝑑 𝒩𝒩∫ 𝓅𝓅𝜇𝜇𝜃𝜃 𝑥𝑥𝑥𝑥𝑦𝑦 𝑒𝑒𝓅𝓅𝑒𝑒𝑒𝑒𝑦𝑦𝜎𝜎𝑑𝑑𝑑𝑑𝑥𝑥 log2 ( | ) 𝜙𝜙 𝑖𝑖 𝑖𝑖 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 𝜙𝜙 2 ( , ) = (𝓆𝓆 (| 𝜀𝜀~𝑦𝑦))𝑥𝑥==𝑥𝑥(𝑦𝑦 | )([ ( (∑, )), 𝑥𝑥( −=| (𝑦𝑦)](+))() |( ) (𝓆𝓆 ( | ) 𝑦𝑦 ) |𝑥𝑥 | ( ) ) 𝜃𝜃 ( 12 (12) ) ( (1311)) 𝜃𝜃 𝜃𝜃 𝜙𝜙 𝑛𝑛 𝑖𝑖=1 𝜙𝜙 𝜙𝜙 𝓅𝓅 𝑥𝑥 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 ( ( |) =) =𝓆𝓆 ( ((𝑦𝑦| ,𝑥𝑥)()=), = 𝒩𝒩 (( (𝜇𝜇(()2)|)) 𝑥𝑥𝜃𝜃), ( 𝑒𝑒 ) 𝑒𝑒𝑒𝑒 ( (𝜎𝜎 ) ) 𝑥𝑥) ( 12 ) ( 11 ) (12) log 𝜙𝜙 ( 𝑖𝑖| ) 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥( 𝜃𝜃) 𝜙𝜙= 𝜃𝜃 (𝑖𝑖 , 𝜙𝜙) 𝜃𝜃 = 𝜙𝜙 ( |𝑖𝑖 ) (𝜃𝜃 ) (11) ℓ ( 𝜙𝜙 | 𝜃𝜃 ) = 𝓆𝓆 ( 𝓅𝓅 − | 𝑦𝑦𝐸𝐸 ( ) 𝑥𝑥 𝑥𝑥 = ( 𝑁𝑁 ) ∫, 𝒩𝒩 𝓅𝓅 ( 𝜇𝜇 𝑙𝑙𝑙𝑙 ( 𝑥𝑥, 𝑙𝑙 𝑥𝑥 𝑦𝑦(𝓅𝓅) 𝑑𝑑)(𝑒𝑒 ) 𝑑𝑑𝑒𝑒𝑒𝑒 𝑥𝑥) |2 𝑦𝑦 ) 𝜎𝜎 ∫ = ( 𝓅𝓅 𝑥𝑥 | 𝐾𝐾𝐾𝐾 ( 𝑥𝑥 ) = 𝑦𝑦 𝓆𝓆 ( 𝓅𝓅 ) 𝑦𝑦 , ( 𝑦𝑦 σ 𝑥𝑥 𝑑𝑑 ( 𝑑𝑑 ( 𝓅𝓅 ) , ( 𝑦𝑦 ) )( ) 12 ) ( ( 14) ) ) ) ( 12 ) (12) where𝜃𝜃 represents𝜃𝜃 the standard 𝜃𝜃normal, and2 denote the mean and the 𝓅𝓅𝜙𝜙 𝑥𝑥 ∫ 𝓅𝓅 𝜃𝜃 𝑥𝑥𝜙𝜙𝑦𝑦 𝑑𝑑𝑑𝑑 ∫ 𝜙𝜙𝓅𝓅 𝑥𝑥 𝑦𝑦 𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 𝓆𝓆 𝑦𝑦 𝑥𝑥( , 𝜙𝜙)𝒩𝒩𝓅𝓅=𝜇𝜇𝜃𝜃𝑥𝑥 𝑦𝑦𝑥𝑥 𝑒𝑒2𝑒𝑒𝑒𝑒 𝜎𝜎𝜃𝜃 [ 𝜙𝜙𝑥𝑥 ( | )] +𝜃𝜃𝜙𝜙 ( 2 ( | )|2| ( )) (13() | ) standard𝜙𝜙 deviation𝓆𝓆 𝑦𝑦𝓅𝓅 𝑥𝑥 respectively.𝑥𝑥𝑖𝑖 ~ ∫(𝒩𝒩𝜙𝜙𝓅𝓅| )𝜇𝜇𝑥𝑥 The𝑦𝑦𝑥𝑥 𝑑𝑑decoder𝑑𝑑 𝑒𝑒𝑒𝑒𝑒𝑒𝜙𝜙∫ 𝓅𝓅 neural𝜎𝜎 𝑥𝑥 𝑥𝑥𝑦𝑦 network𝓅𝓅 𝑦𝑦 𝑑𝑑𝑑𝑑 takes the latent 𝜙𝜙 𝓆𝓆 𝑦𝑦 𝑥𝑥𝐿𝐿 𝜙𝜙 ∑ ℓ 𝜙𝜙 𝜙𝜙𝜃𝜃𝓆𝓆 log𝑦𝑦 𝑥𝑥𝓆𝓆𝜙𝜙 (𝑦𝑦𝒩𝒩𝑥𝑥| 𝜇𝜇) 𝑥𝑥𝒩𝒩 𝑒𝑒𝜇𝜇𝑒𝑒𝑒𝑒𝜙𝜙 𝑥𝑥𝜎𝜎𝜙𝜙𝑒𝑒𝑥𝑥𝑒𝑒𝑒𝑒 𝜎𝜎𝜙𝜙 𝑥𝑥 𝓆𝓆 𝑦𝑦 feature𝑥𝑥 𝒩𝒩 representation𝜇𝜇 𝑥𝑥𝑖𝑖=log1𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎y( as|𝑥𝑥 )( input | ) and its outputs are the parameters to the 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥 ( , ) = ~ probability ( | ℓ () [ 𝜙𝜙 ) =distribution 𝜃𝜃 log ( − |( 𝐸𝐸( ), |]of=+) the𝑁𝑁 =data( 𝑙𝑙𝑙𝑙( (𝑙𝑙 , 𝓅𝓅| ( ) ) | |𝑥𝑥 | ) . 𝑦𝑦 As( ( ) ) the ) 𝐾𝐾𝐾𝐾 decoder 𝓆𝓆 𝑦𝑦 tries 𝑥𝑥 to( 𝓅𝓅 13 reconstruct ) 𝑦𝑦 ( 11 ( 14) ) 𝜃𝜃( | ) ( | ) = from ( thelog( real-valued), ( | ( ) (𝓅𝓅 numbers)))𝑥𝑥log𝓆𝓆 𝑦𝑦𝜙𝜙 𝑦𝑦 𝑥𝑥in ( 𝜃𝜃 y |( log with |) ) less ( log | dimensionality ) ( |(12) ) to real-valued 𝜙𝜙 𝑖𝑖 𝜃𝜃 𝑖𝑖𝓅𝓅 𝑥𝑥 𝑦𝑦 ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 −𝐸𝐸𝑦𝑦 𝓆𝓆 𝑦𝑦𝓅𝓅𝑥𝑥𝜃𝜃 𝑥𝑥𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅∫𝜃𝜃 𝓅𝓅𝓅𝓅𝑥𝑥𝜃𝜃𝑖𝑖2 𝑦𝑦𝑥𝑥𝑥𝑥𝜙𝜙𝐿𝐿𝑦𝑦 𝑑𝑑∑𝐾𝐾𝐾𝐾𝑑𝑑 ℓ𝓆𝓆∫𝜙𝜙𝜙𝜙𝓅𝓅𝑦𝑦𝜃𝜃𝜃𝜃𝑥𝑥𝑥𝑥𝑖𝑖 𝑦𝑦 𝓅𝓅 𝓅𝓅 𝜃𝜃 𝑦𝑦 𝑦𝑦 𝑑𝑑 𝑑𝑑 ( | ) = ( ( ), ( ( ))) (12) 𝜙𝜙 numbers𝜙𝜙 in x of higher𝓆𝓆 𝑦𝑦 dimensionality,𝑥𝑥 some information may be lost. This 𝓆𝓆 (𝑦𝑦 ,𝑥𝑥 ) = 𝒩𝒩 𝜇𝜇 𝑥𝑥 𝜃𝜃𝑒𝑒𝑒𝑒𝑒𝑒[ 𝜎𝜎𝜙𝜙 𝑥𝑥 ( |𝑖𝑖=)1] + 𝓆𝓆𝜙𝜙( 𝑦𝑦 𝑥𝑥(𝓅𝓅𝜃𝜃| 𝑥𝑥)|𝑦𝑦| 𝜃𝜃( )) (13) reconstruction ~ ( 𝓅𝓅 | )𝑥𝑥 ( loss 𝑦𝑦 | ) is = calculated ( ( 𝜃𝜃 using)=, 𝑁𝑁 log-likelihood ( (( ), ))) 𝓅𝓅 𝑥𝑥 . 𝑦𝑦 ( 12 ) (14) 2 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝜙𝜙 𝜙𝜙 𝜙𝜙 ( , ) = ~ ( | )[ ( | )] + ( ( | )|| ( )) (13) 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝒩𝒩 𝜇𝜇 𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎 𝑥𝑥 ( , ) = 𝜙𝜙( |~𝑖𝑖) =( |( )[( ), ( (( )|)2) ) ] + ( ( | ) | | (12() )) (13) 𝑖𝑖 𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝜃𝜃 𝑖𝑖( | ) 𝜙𝜙 𝑖𝑖 𝑖𝑖 𝜃𝜃 ( ) ℓ 𝜙𝜙 𝜃𝜃 log − 𝐸𝐸 [ ( | ) ( 𝜙𝜙 , |𝑙𝑙𝑙𝑙 ) )𝑙𝑙 ]= 𝓅𝓅( ,(𝑥𝑥)|𝑦𝑦=)𝜙𝜙=(( || 𝐾𝐾𝐾𝐾)[)( 𝓆𝓆 ( 𝑦𝑦𝜙𝜙),[(𝑥𝑥 | (𝓅𝓅)] (+(𝑦𝑦)|)))( ] + ( | ( ) | | ( | ( ) ) | )| ( ) ) ( 12 ) ( 13 ) (13) , = ~ ( | ) 𝜙𝜙 𝑖𝑖 =𝓆𝓆 𝑁𝑁𝑦𝑦 𝑥𝑥(+, 𝒩𝒩)( ~ 𝜇𝜇 𝑥𝑥𝐿𝐿 ~ 𝑒𝑒 |𝑒𝑒𝑒𝑒 ∑ | ( |𝜎𝜎 ℓ ( ) ) 𝜙𝜙𝑥𝑥 ) 𝜃𝜃 ( 13 )( 14) 𝑖𝑖 𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝜃𝜃 𝑖𝑖 𝜙𝜙2𝑖𝑖=1 𝑖𝑖 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃( −𝐸𝐸) 𝜙𝜙 𝑙𝑙𝑙𝑙𝑙𝑙 𝜙𝜙𝓅𝓅 𝑥𝑥 𝑖𝑖[𝑦𝑦𝜙𝜙𝜙𝜙 𝐾𝐾𝐾𝐾 (𝓆𝓆 𝜙𝜙|𝑦𝑦 𝑥𝑥)] 𝓅𝓅 𝑦𝑦 2 ( | ) log ( | ) 𝑖𝑖 , 𝜙𝜙 =𝑖𝑖 𝓆𝓆 𝑦𝑦𝑦𝑦~𝑥𝑥 𝓆𝓆( |𝑦𝑦𝒩𝒩𝑥𝑥)𝜇𝜇𝓆𝓆 𝑥𝑥𝑦𝑦 𝑥𝑥𝜙𝜙𝑒𝑒𝑒𝑒𝑒𝑒 𝑖𝑖𝜃𝜃𝜎𝜎 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑖𝑖 + ( 𝜙𝜙 || 𝑖𝑖 ( )𝜃𝜃) (13 ) 𝑖𝑖 ℓ 𝑦𝑦𝜙𝜙𝓆𝓆𝜃𝜃𝑦𝑦 𝑥𝑥 −𝜃𝜃 𝐸𝐸ℓ𝜃𝜃𝑖𝑖 𝜙𝜙𝑖𝑖 𝜃𝜃 𝑖𝑖𝑖𝑖 −𝜙𝜙 𝐸𝐸𝑦𝑦𝑙𝑙𝑙𝑙𝓆𝓆𝜙𝜙𝑙𝑙𝑦𝑦𝓅𝓅𝑥𝑥𝑦𝑦 𝑖𝑖 𝓆𝓆𝑙𝑙𝑙𝑙𝑥𝑥𝜙𝜙𝑙𝑙𝑦𝑦𝜃𝜃𝓅𝓅𝑥𝑥𝑦𝑦𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦 𝐾𝐾𝐾𝐾𝜃𝜃𝜙𝜙 𝑖𝑖𝐾𝐾𝐾𝐾𝓆𝓆𝓆𝓆𝜙𝜙 𝑦𝑦𝑦𝑦 𝑥𝑥𝑖𝑖 𝜙𝜙 𝓅𝓅𝜃𝜃𝓅𝓅𝑦𝑦𝑖𝑖 𝑦𝑦𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 Unlike𝓅𝓅𝑙𝑙𝑙𝑙 𝑙𝑙other𝑥𝑥𝐿𝐿𝓅𝓅𝑦𝑦 conventional𝑥𝑥∑𝑦𝑦ℓℓ 𝓆𝓆𝜙𝜙𝜙𝜙𝐾𝐾𝐾𝐾𝜃𝜃𝜃𝜃𝑦𝑦 𝑥𝑥autoencoders,𝓆𝓆 −𝑦𝑦𝐸𝐸𝒩𝒩𝑥𝑥 𝜇𝜇𝓅𝓅 𝑥𝑥 the𝑦𝑦 𝑒𝑒𝑙𝑙𝑙𝑙 loss𝑒𝑒𝑒𝑒𝑙𝑙 𝓅𝓅 𝜎𝜎function𝑥𝑥𝑥𝑥 𝑦𝑦 used𝐾𝐾𝐾𝐾 in DVAE𝓆𝓆 𝑦𝑦 𝑥𝑥 is the𝓅𝓅 𝑦𝑦 𝑖𝑖=1 log ( | ) 𝜃𝜃 𝑖𝑖 negative 𝑦𝑦log-likelihood 𝓆𝓆 𝜙𝜙 𝑦𝑦=𝑥𝑥𝑖𝑖𝑁𝑁 with( , an𝜃𝜃) additional 𝑖𝑖 regularizer. 𝜙𝜙 As all 𝑖𝑖 the data 𝜃𝜃 (points14) 𝓅𝓅 𝑥𝑥 𝑦𝑦 ( ) ℓ 𝜙𝜙 𝜃𝜃 [ − ( 𝐸𝐸 | ()] | )log= 𝑙𝑙𝑙𝑙((( 𝑙𝑙||(𝓅𝓅))) , 𝑥𝑥 ( 𝑦𝑦 ( ))) 𝐾𝐾𝐾𝐾 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 (12) , = ~ ( | ) do not share global=+ 𝑁𝑁 (representation,( , ) 𝜃𝜃 | | log ( the ) )( loss | ) function is ( 13 decomposed ) (14 into) just 𝑖𝑖 𝓅𝓅 𝑥𝑥 𝑦𝑦 2 𝐿𝐿 ∑ 𝜃𝜃 ℓ 𝜙𝜙 𝜃𝜃 = ( , ) ( ) [ ((14|))] ( | ) 𝜙𝜙 𝑖𝑖 terms that = 𝜙𝜙 𝑁𝑁 rely ( on , a )single 𝓅𝓅 𝑥𝑥 𝜙𝜙 𝑦𝑦 data point. 𝜙𝜙 The 𝑁𝑁 loss = function 𝑁𝑁 ( , ) ( for 14 , ) a single = data ~ ( | ) +(14) ( || ( )) (13) 𝑖𝑖 𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝜃𝜃 𝓆𝓆𝑖𝑖 𝑦𝑦 𝑥𝑥 𝑖𝑖=𝒩𝒩𝑖𝑖 1𝜙𝜙 𝜇𝜇 𝑖𝑖𝑥𝑥 𝑒𝑒𝑒𝑒𝑒𝑒𝜃𝜃 𝜎𝜎𝜃𝜃 𝑥𝑥 ℓ 𝜙𝜙 𝜃𝜃 − 𝐸𝐸 𝑙𝑙𝑙𝑙point𝑙𝑙 𝓅𝓅 x𝑥𝑥 is 𝑦𝑦computed𝐿𝐿 ∑𝐾𝐾𝐾𝐾ℓ 𝓆𝓆by:𝜙𝜙𝑦𝑦𝜃𝜃𝑥𝑥 𝓅𝓅 𝓅𝓅𝑦𝑦 𝑥𝑥( 𝑦𝑦 ) ( , ) = i ~ (𝑖𝑖 = |1 ) [ =( |𝑁𝑁)] + (, ( | ) | | ( ) ) 𝜙𝜙 ( 13 ) 𝑖𝑖 (14) 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝐿𝐿 ∑𝐿𝐿ℓ𝑖𝑖 𝜙𝜙∑𝜃𝜃 ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 −𝐸𝐸𝑦𝑦 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆𝜙𝜙 𝑦𝑦 𝑥𝑥𝑖𝑖 𝓅𝓅𝜃𝜃 𝑦𝑦 ( , ) = ~ ( | )[ ( | )] + ( 𝑖𝑖(=1| )|| ( )) (13) 𝑖𝑖 = 1 𝜙𝜙 𝑖𝑖 log =( |𝑁𝑁) ( , ) 𝑖𝑖 = 1 (14) ℓ 𝑖𝑖 𝜙𝜙 𝜃𝜃 (−𝐸𝐸, 𝑦𝑦) 𝓆𝓆= 𝑦𝑦 𝑥𝑥 ~𝑙𝑙𝑙𝑙𝑙𝑙( 𝓅𝓅| 𝜃𝜃)[𝑥𝑥𝑖𝑖 𝑦𝑦 (𝑖𝑖𝐾𝐾𝐾𝐾| )𝓆𝓆]𝜙𝜙+𝑦𝑦 𝑥𝑥(𝑖𝑖 𝓅𝓅(𝜃𝜃|𝑦𝑦 )|| ( )) (13) (13) 𝜙𝜙 𝑖𝑖 𝐿𝐿 ∑ ℓ 𝜙𝜙 𝜃𝜃 ℓ 𝑖𝑖 𝜙𝜙 𝜃𝜃 =−𝐸𝐸𝑁𝑁𝑦𝑦 𝓆𝓆 ( 𝑦𝑦,𝑥𝑥 ) 𝑙𝑙𝑙𝑙 𝑙𝑙 𝓅𝓅 𝜃𝜃 𝑥𝑥 𝜃𝜃𝑖𝑖 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆 𝜙𝜙 𝑦𝑦 𝑥𝑥 𝑖𝑖 𝓅𝓅 𝜃𝜃 𝑦𝑦 (14) 𝑖𝑖 𝑦𝑦 𝓆𝓆𝜙𝜙𝓅𝓅𝑦𝑦 𝑥𝑥𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖=1𝑖𝑖𝜃𝜃 𝑖𝑖 𝜙𝜙 𝑖𝑖 𝜃𝜃 Thus,ℓ for𝜙𝜙 total𝜃𝜃 data−𝐸𝐸 points𝐿𝐿 the∑ 𝑙𝑙𝑙𝑙overall𝑙𝑙ℓ𝓅𝓅 𝜙𝜙loss 𝑥𝑥 𝜃𝜃𝑦𝑦 would 𝐾𝐾𝐾𝐾 be: 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 = 𝑁𝑁 ( , ) (14) 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑖𝑖=1 ( , ) = 𝑖𝑖 = 1 [ = ( 𝑁𝑁 | )(] +, ) ( ( | ) | | ( ) ) ( 13 ()14 ) ~ ( | ) (14) 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 = 𝑁𝑁 ( , ) (14) 𝑖𝑖=1 𝑖𝑖 𝑦𝑦 𝓆𝓆 𝜙𝜙 𝑦𝑦 𝑥𝑥 𝑖𝑖 𝐿𝐿 𝜃𝜃 ∑ 𝑖𝑖 ℓ𝑖𝑖 =𝜙𝜙 𝜃𝜃𝑁𝑁 (𝜙𝜙 , ) 𝑖𝑖 𝜃𝜃 (14) ℓ 𝜙𝜙 𝜃𝜃 −𝐸𝐸 𝑙𝑙𝑙𝑙𝑙𝑙 𝓅𝓅 𝑥𝑥 𝑦𝑦 𝐾𝐾𝐾𝐾 𝓆𝓆 𝑦𝑦 𝑥𝑥 𝓅𝓅 𝑦𝑦 𝐿𝐿 ∑ ℓ𝑖𝑖𝑖𝑖=𝜙𝜙1 𝜃𝜃 This DVAE is trained𝑖𝑖 =to1 reconstruct𝐿𝐿 ∑ nativeℓ𝑖𝑖 𝜙𝜙images𝜃𝜃 from their 20% noisy form. 𝑖𝑖=1 = ( , ) (14) 𝑁𝑁 244 𝐿𝐿 ∑ ℓ𝑖𝑖 𝜙𝜙 𝜃𝜃 𝑖𝑖=1 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Figure 4. DVAE architecture for image denoising. Figure 4. DVAE architecture for image denoising. Thus, for total data points the overall loss would be: Proposed Hybrid Models for Noisy Image Classification

This section explains the proposed hybrid models DAE-CNN, CDAE-CNN, This DVAE is trained to reconstruct native images from their 20% noisy form. DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN for noisy image classification. The common feature of all these models is that a CNN is used as Proposed aHybrid classifier Models which for takes Noisy denoisedImage Classifi imagecation (i.e. reconstructed) from the prior AE of a particular model. Conventional AE(s) of a model are trained individually This sectionwith explains regular the noise proposed and CNNhybrid is models trained DAE with-CNN, noise-free CDAE image.-CNN, Finally,DVAE- CAE(s)NN, DAE -CDAE- CNN and andDVAE CNN-CDAE are cascaded-CNN for to noisy form image a particular classifi hybridcation. modelThe common and no featurefurther oftraining all these models is that a CNNis isperformed. used as a classifierThe following which subsectionstakes denoised explain image the(i.e. architectural reconstructed) description from the prior AE of a particular asmodel well. Conventionalas the working AE(s) procedures of a model of eachare trained individual individually model. with regular noise and CNN is trained with noise-free image. Finally, AE(s) and CNN are cascaded to form a particular hybrid model and no further Hybridtraining isModel performed. 1: DAE-CNN The following Architecture subsections explain the architectural description as well as the working procedures of each individual model. The proposed DAE-CNN is a supervised deep network designed in order to perform image classification regardless of the possibility of they being noisy. Hybrid ModelWith layer-wise 1: DAE-CNN training, Architecture the whole architecture of the DAE-CNN is optimized.

Fig. 5 shows the all-inclusive architecture of the proposed DAE-CNN model. The proposed DAE-CNN is a supervised deep network designed in order to perform image classification This model is a fusion of DAE and a two-layered CNN. In the first place, the regardless of the possibility of they being noisy. With layer-wise training, the whole architecture of the noisy image is refined by the DAE, and afterward the reconstructed image is DAE-CNN is optimized. Fig. 5 shows the all-inclusive architecture of the proposed DAE-CNN model. fed to the accompanying CNN. DAE filters the noises from the input images This modviael is the a fusion reconstruction of DAE and process. a two- layeredAll the CNN.encoder In theand first decoder place, parametersthe noisy image (the is refined by the DAE,input-hidden and afterward and the thereconstructed hidden-output image weights) is fed to theare accompanyinginitialized by CNN.the weights DAE filters of the noises from the theinput DAE images trained via the before reconstruction (discussed process. in the AllDAE the section). encoder Theand decoderfollowing parameters CNN (the input- hidden andis the designed hidden - withoutput two weights) convolution-subsampling are initialized by the weights layers; of atthe first, DAE trained a following before (discussed in the DAEdense section layer). The and following finally CNNan output is designed layer. withAll the two parametersconvolution -ofsubsampling the CNN (thelayers ; at first, a following dense layer and finally an output layer. All the parameters of the CNN (the hidden-output weights, local averaging parameters, and kernels)245 are set to the corresponding parameters used in the pre- trained CNN as discussed in the CNN section. In the end, only via a forward pass, this architecture does 11

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 hidden-output weights, local averaging parameters, and kernels) are set to the corresponding parameters used in the pre-trained CNN as discussed in the CNN section. In the end, only via a forward pass, this architecture does the noisy image classification task.

Figure 5. DAE-CNN architecture for noisy image classification. Figure 5. DAE-CNN architecture for noisy image classification. the noisy image classification task.

Hybrid Model 2: CDAE-CNN Architecture Hybrid Model 2: CDAE-CNN ArchitectureCDAE-CNN is another supervisedCDAE-CNN isdeep another network supervised deepused network in this used study in this study for forclassifying classifying noi noisysy images images as shown as in Fig. 6. It is a combination of a CDAE at first and follows a two-layered CNN in the very same manner shownDAE-CNN in architectureFig. 6. It incorporates is a combination a DAE as an image of a reconstructor CDAE at and first a CNN andas a classifier.follows Serving a two- layeredas a filter CNN as well inas athe reconst veryructor same, the CDAE manner reconstructs DAE-CNN noise-free architectureimages from the noisyincorporates version fed a DAEto it and as then an passesimage it to reconstructor the following CNN. and The akernel CNN weights as aalong classifier. all the parameters Serving of both as CDAE a filter asand well the CNNas a used reconstructor, here were initialized the withCDAE the value reconstructs of the corresponding noise-free parameters images of the pre from-trained the CDAE and CNN (discussed in the CDAE and CNN section). noisy version fed to it and then passes it to the following CNN. The kernel weightsHybrid Model along 3: DVAEall the-CNN parameters Architecture of both CDAE and the CNN used here were initialized with the value of the corresponding parameters of the pre-trained

CDAE and CNN (discussed in the CDAE and CNN section).

Hybrid Model 3: DVAE-CNN Architecture

The DVAE-CNN architecture incorporates one image reconstructor and a following classifier like DAE-CNN and CDAE-CNN architecture. In this model DVAE serves as the noise filter as well as the image reconstructor. At first the noisy image is fed to the DVAE. Like DAE and CDAE it also reconstructs noise-free native images from the noisy input images but in a variational inference manner. Moreover, it uses an additional regularizer along withFigure the6. CDAE negative-CNN architecture log-likelihood for noisy-image which classification. is common in all other traditional autoencoders. The following two-layered CNN takes this reconstructed 12 and less noisy image as input and classifies it. The inclusive architecture is optimized via layer-wise training. Fig. 7 gives a proper demonstration of this

246

Figure 5. DAE-CNN architecture for noisy image classification. Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 the noisy image classification task.

Hybrid Model 2: CDAE-CNN Architecture model. All the weights between the input and hidden layers as well as between CDAE-CNN is another supervised deep network used in this study for classifying noisy images as shown the hiddenin Fig. 6. Itand is a outputcombination layers of a CDAE of the at DVAEfirst and follows are initialized a two-layered with CNN thein the corresponding very same manner weightsDAE-CNN of the architecture pretrained incorporates DVAE a DAE (discussed as an image reconstructor in the DVAE and a CNN section). as a classifier. After Serving this, the as classifier a filter as well as CNNa reconst isructor initialized, the CDAE reconstructs containing noise-free imagestwo from convolution-subsampling the noisy version fed layers,to it anda dense then passes layer it to and the following in the end,CNN. Thean kerneloutput weights layer. along All all thethe parameters parameters of both of CDAE this and the CNN used here were initialized with the value of the corresponding parameters of the pre-trained CNNCDAE are and initialized CNN (discussed with in the CDAE ones and of CNN the sectionvery ).same parameters used in the pre- trained CNN (discussed in the CNN section). A simple forward pass would thenHybrid employ Model DVAE-CNN 3: DVAE-CNN Architecturein the classification task.

The DVAE-CNN architecture incorporates one image reconstructor and a following classifier like DAE- CNN and CDAE-CNN architecture. In this model DVAE serves as the noise filter as well as the image reconstructor. At first the noisy image is fed to the DVAE. Like DAE and CDAE it also reconstructs noise-free native images from the noisy input images but in a variational inference manner. Moreover, it uses an additional regularizer along with the negative log-likelihood which is common in all other traditional autoencoders. The following two-layered CNN takes this reconstructed and less noisy image as inputFigure and 6classifies. CDAE- CNNit. The architecture inclusive arch for noisyitecture-image is optimized classification. via layer -wise training. Fig. 7 gives a proper Figuredemonstration 6. CDAE-CNN of this model. All architecture the weights between for thenoisy-image input and hidden classification. layers as well as between the

hidden and output layers of the DVAE are initialized with the corresponding weights of the pretrained 12 DVAE (discussed in the DVAE section). After this, the classifier CNN is initialized containing two Hybridconvolution Model-subsampling 4: D layers,AE-CDAE-CNN a dense layer and inArchitecture the end, an output layer. All the parameters of this CNN are initialized with the ones of the very same parameters used in the pre-trained CNN (discussed in Thethe CNN hybrid section ). DAE-CDAE-CNN-supervised A simple forward pass would then employ DVAE image-CNN classifierin the classification incorporates task. bothHybrid Model the 4: denoising DAE-CDAE -CNN and Architecture convolutional approaches (DAE and CDAE)The hybrid for DAE filtering-CDAE-CNN - noisysupervised images image classi andfier reconstructing incorporates both the noise-free denoising and raw convolutional approaches (DAE and CDAE) for filtering noisy images and reconstructing noise-free raw

Figure 7. DVAE-CNN architecture for roisy image classification.

images from them. The all-embracing structure of the DAE-CDAE-CNN is exhibited in Fig. 8. It has three basic components: first a DAE, them a CDAE,

247

13

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Figure 8. DAE-CDAE-CNN architecture for noisy image classification Figure 8. DAE-CDAE-CNN architecture for noisy image classification. images from them. The all-embracing structure of the DAE-CDAE-CNN is exhibited in Fig. 8. It has three basic components: first a DAE, them a CDAE, both serving as image reconstructors, and finally a bothtwo-layered serving CNN as serving image as an reconstructors, image classifier. The and fundamental finally point a two-layeredof this method is toCNN enhance serving the asaccuracy an image of the classifier. image classification The fundamental with better reconstructions point of of this the noisy method images is by to having enhance a good the accuracyquality. The of first the image image reconstructor classification DAE’s input -hiddenwith weightsbetter asreconstructions well as thehidden-output of the weights noisy are set to the value of the same pre-trained DAE’s corresponding weights as in DAE-CNN architecture. imagesDAE tries by to reconstructhaving a the good raw . emitting The the firstnoise from image the noisy reconstructor input image serving DAE’s as a filterinput- hiddenand outputs weights a reconstructed as well image as thehidden-outputwith less noise. This reconstructed weights intermediate are set to image the valueis then fedof tothe samethe CDAE pre-trained for further DAE’s denoising. corresponding Compared to DAE, weights CDAE yields as ina betterDAE-CNN reconstruction architecture. in case of DAEimages. tries As thisto reconstruct CDAE is fed the with raw less image noisy images emitting than the the original noise input, from it the outputs noisy a better input intermediate representation of the image for the following classifier. The kernels and other performance imageparameters serving of this as CDAE a filter are regulated and uniformlyoutputs toa thereconstructed pre-trained CDAE image discussed with in theless CDAE noise. Thissection. reconstructed The two- layered intermediateCNN is also regulated image uniformly is then to the fed CNN to, trained the CDAEwith zero fornoise furtheradded denoising.images for cla Comparedssification purpose to DAE,(discussed CDAE in the CNN yields section). a better reconstruction in case of images. As this CDAE is fed with less noisy images than the original input, Hybrid Model 5: DVAE-CDAE-CNN Architecture it outputs a better intermediate representation of the image for the following classifier.DVAE-CDAE -TheCNN (shownkernels in Fandig. 9) otherworks inperformance the same manner parametersas the DAE-CDAE of -CNNthis architectureCDAE are regulated(discussed in uniformly the section on to Hybrid the pre-trainedModel 4: DAE -CDAECDAE-CNN discussed Architecture in) andthe contain CDAEs two section. image Thereconstructors: two- layered at initial CNN point, is a also DVAE regulated, and then auniformly CDAE. As to DVAE the performsCNN, trained better image with reconstruction than the traditional DAE (Im et al., 2017) the input image for CDAE is better in quality zerohere compared noise addedto the DAE images-CDAE- CNN for arch classificationitecture. As a result, purpose the hybrid image (discussed reconstructor in DVAE the- CNN section).CDAE outputs better images for the following CNN classifier compared to the DAE-CDAE-CNN architecture resulting in a better image classification in case the image is noisy. All the parameters in this HybridDVAE are Model tuned to 5:the DVAE-CDAE-CNNcorresponding parameters’ value Architecture of the pre-trained DVAE (as specified in the DVAE section). The kernels, hidden-output weights along with the local averaging parameters used in this structure are initialized with corresponding parameter values of the CDAE and CNN previously DVAE-CDAE-CNNtrained (discussed in the CDAE (shown and the CNN in Fig.section ) 9). works in the same manner as the DAE-CDAE-CNN architecture (discussed in the section on Hybrid Model 4: DAE-CDAE-CNN Architecture) and contains two image reconstructors: at initial point, a DVAE, and then a CDAE. As DVAE performs better image 14 reconstruction than the traditional DAE (Im et al., 2017) the input image for CDAE is better in quality here compared to the DAE-CDAE-CNN architecture. As a result, the hybrid image reconstructor DVAE-CDAE outputs better images for the following CNN classifier compared to the DAE-CDAE- CNN architecture resulting in a better image classification in case the image is noisy. All the parameters in this DVAE are tuned to the corresponding

248 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 parameters’ value of the pre-trained DVAE (as specified in the DVAE section). The kernels, hidden-output weights along with the local averaging parameters used in this structure are initialized with corresponding parameter values of the CDAE and CNN previously trained (discussed in the CDAE and the CNN section).

FigureFigure 9. 9.DVAE DVAE-CDAE-CNN-CDAE-CNN architecture for architecture recreating native for images recreating from debased native configurations images of them because of noise. from debased configurations of them because of noise.

PERFORMANCE EVALUATION PERFORMANCE EVALUATION

This section investigates the performances of the proposed hybrid models on theThis benchmark section investigates datasets the performances of two different of the proposed categories: hybrid models MNIST on the benchmark numeral datasets images of two different categories: MNIST numeral images and CIFAR-10 object images. This section first anddescribes CIFAR-10 the datasets object and the images.experimental This setups sectionused to work first over these describes datasets. Experiments the datasets were and theconducted experimental at different setups noise levels used and to the work proficiency over of these the models datasets. were compared Experiments against existing were conductedmodels. These at models different were implemented noise levels in Matlab and R2015a. the Theproficiency performance analysis of the was conductedmodels onwere comparedMacBook Proagainst Laptop existing(CPU: Intel models. Core i5 @These 2.70 GHzmodels and RAM:were 8.00implemented GB) in OS-X in Yosemite Matlab environment. R2015a. The performance analysis was conducted on MacBook Pro Laptop (CPU:Data Description Intel Core i5 @ 2.70 GHz and RAM: 8.00 GB) in OS-X Yosemite environment. Image data corrupted with noise to occurs while dealing with real life practical applications. Even when a well-established system is employed on real-life data that system might fail only because of the Datainappropriateness Description of the data. Therefore, it is highly required to preprocess those image data prior to applying them in the practical application plot. With the intention to cope with this type of scenario, and Imageat the samedata time corrupted to show the with significance noise of to the occurs proposed while models dealingwe considered with two real benchmark life practical datasets: applications.MNIST (LeCun Evenet al. 2010 when) and CIFARa well-established-10 (Coates et al. 2011) system, in this isstudy employed. A large number on ofreal-life recent studies utilized these two datasets considering the image data as a source (LeCun et al., 1998; Vincent et dataal., 2008; that Vincent system et al., might 2010; Masci fail et al., only 2011 because). of the inappropriateness of the data. Therefore, it is highly required to preprocess those image data prior to applyingMNIST Dataset them: T inhe dataset the practicalcontains 70000 application 28x28-sized sample plot. images With with the a large intention variety of to distinct cope withnumeral this images type from of scenario,various individuals and rehat earsingthe same distinctive time individual to show writing the patterns. significance The images of the proposed models we considered two benchmark datasets: MNIST (LeCun 15 et al. 2010) and CIFAR-10 (Coates et al. 2011), in this study. A large number

249 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 of recent studies utilized these two datasets considering the image data as a source (LeCun et al., 1998; Vincent et al., 2008; Vincent et al., 2010; Masci et al., 2011).

MNIST Dataset: The dataset contains 70000 28x28-sized sample images with a large variety of distinct numeral images from various individuals rehearsing distinctive individual writing patterns. The images are divided into training and test sets. The test set holds 10000 images having 1000 samples for each of arethe divided 10 numerals into training and and testthe sets. training The test set holds contains 10000 images 60000 having images 1000 sampleshaving for 6000 each of imagesthe 10 numeralsfor every and theindividual training set containsdigit. 60000Fig. images10(a) havingdisplays 6000 imagesfew sample for every indimagesividual digit.of everyFig. 10(a)handwritten displays few numeral sample imagesCIFAR-10 of every handwritten Dataset: numeral. This dataset contains 32x32- sized 60000 samples of colored images of ten different CIFAR-10 Dataset: This dataset contains 32x32-sized 60000 samples of colored images of ten different

(a) MNIST

(b) CIFAR-10

Figure 10. Samples of some images from MNIST and CIFAR-10 datasets. Figure 10. Samples of some images from MNIST and CIFAR-10 datasets. objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck). The training set consists of 50000 distinct images having 6000 samples of every class. The remining 10000 sample images are used objectsas test (airplane, set. In the test automobile, set, each of every bird, object cat, hasdeer, exactly dog, 1000 frog, sample horse, images ship which and are truck). mutually Theexclusive. training A f ewset images consists from each of class50000 are shown distinct in Fig. images 10(b). For havingthe experiments, 6000 these samples images weofre everyconverted class. to greyThe-scale. remining 10000 sample images are used as test set. In the test set, each of every object has exactly 1000 sample images which are mutually Experimental Setup exclusive. A few images from each class are shown in Fig. 10(b). For the experiments,We experimented these the proposedimages hybrid were models converted over MNIST to grey-scale. and grey-scaled CIFAR-10 datasets. Images in these two datasets were different in size. MNIST contained images of size 28x28 whereas images in ExperimentalCIFAR-10 were ofSetup size 32x32 forcing us to apply different architecture for the proposed models. This section describes the actual architectural setup used to work with MNIST and CIFAR-10 datasets. We experimented the proposed hybrid models over MNIST and grey-scaled A uniform experimental environment was set up for fair investigation among the proposed and the CIFAR-10existing methods. datasets. As the images Images from in the thesedataset were two of datasetssize 28×28 (MNIST) were different and 32×32 (CIFA in size.R-10), MNISTeach of containedthese classifiers images had 784 of(and size 1024) 28x28 input units whereas so as to takeimages the linearized in CIFAR-10 version of thewere data. of As size 32x32 forcing us to apply different architecture for the proposed models. This section describes the actual architectural setup used to work with MNIST and CIFAR-10 datasets.

A uniform experimental environment was set up for fair investigation among the proposed and the existing methods. As the images from the dataset were of size 28×28 (MNIST) and 32×32 (CIFAR-10), each of these classifiers

250 16

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 had 784 (and 1024) input units so as to take the linearized version of the data. As the data was divided into 10 classes, each of these classifiers had 10 units in the output layer. The intermediate portion of each of the network varied based on its architecture. DAE and DVAE had hidden layer size of 500 (and 700). Additionally, DVAE had an additional latent representation layer of size two. On the other hand, CDAE had kernel size of 5×5 and a subsample window of 2×2 local averaging area. Throughout the experiments, a two-layered CNN was used with all of the AEs (conventional and hybrid) having two convolution-subsampling layers. For both convolutional layers, the kernel size remained fixed and was 5× 5 , in both subsampling layers; the size of the pooling area was 2 × 2.

Due to the large-sized training set, batch-wise training was performed; and all of the experiments were conducted with a fixed batch size of 50. Weights of each of these networks were updated once for a batch of image patterns and batch size (BS), i.e. the number of patterns in a batch, was considered as a user-defined parameter in such a way that the total training patterns were completely divisible by the BS value. For the experiments, the learning rate (i.e. eta) values were varied in the range of 0.1 to 1.0.

EXPERIMENTAL RESULTS AND ANALYSIS

As these models were validated against two datasets, the experimental results and analysis are presented in two different subsections.

Result on MNIST Dataset

This section illustrates the performance of the proposed models over the MNIST data set with noise of different proportions injected in it. Fig. 11 delineates the result of the noise removal step utilizing DAE, CDAE, DAE-CDAE, DVAE-CNN, DVAE-CDAE-CNN separately on 50% noisy image data and these reconstructed images were fed to the following CNN classifier. Initially, the images in the dataset were pre-processed and without any additional noise added in it. With a specific end goal to assess the performance of these proposed hybrid classifiers on noisy images, noise was added manually to the images in the dataset. Zero masking noise was used for conducting the experiments in which an arbitrary matrix with the equal size of training image data was initialized where some of the being arbitrarily OFF having the probability of 20% for both training and test cases and then 50% only for the test case. It can be clearly seen that the reconstructed image from DAE-CDAE and DVAE-CDAE were much better than the reconstructed ones from the

251 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 standalone AEs. Because of DVAE using the variational upper bound on the log-likelihood of the data as loss function rather than normal reconstruction error as DAE, it produces better representation of the reconstructed images than DAE. However, DVAE produced a little blurry image but it kept the shapes of the objects more accurate than DAE. So, when a CDAE was used in a cascaded manner after DVAE this blurriness also got omitted resulting in the DVAE-CDAE architecture to output better reconstructed images than DAE- CDAE in terms of 50% noisy input images.

Original Image

With 50% noise

DAE

CDAE

DVAE

DAE - CDAE

Reconstructed Image Image Reconstructed

DVAE-CDAE

Figure 11. Sample of original images from MNIST dataset with and without noise and their reconstruction Figureusing different 11. Sample AEs. of original images from MNIST dataset with and without noise and their reconstruction using different AEs. The test set classification performance of all five models proposed in this study along with a simple CNN for both scenarios when the images were corrupted with 20% noise as well as 50% noise are portrayed in

TheFig. test12. T sethe classification classification accuracy notes performance up to 400 interactions of alland Fig. five 12(a) gives models evidence proposed that the in this DVAE-CNN architecture surpasses all other architecture in terms of 20% noisy image classification with study98.84 along % accuracy. with CDAE a simple-CNN confirms CNN the for second both position scenarios with 98.69% when accuracy. the images The accuracy were corruptedrecorded for with DAE -CNN, 20% DAE noise-CDAE as-CNN, well D V asAE - 50%CDAE - noiseCNN and are simple portrayed CNN architectures in Fig. are 12. The98.01%, classification 97.40%, 97.43% accuracy and 97.76% notesrespectively. up to In Fig.400 1 2(b)interactions it is clearly visible and thatFig. whenever 12(a) thegives same test set images are corrupted with 50% noise, the DVAE-CDAE-CNN architecture surpasses all evidenceother models that attaining the DVAE-CNN 96.74% accuracy whereasarchitecture it shows least surpasses accurate classificaall othertion resultarchitecture in the in previousterms case.of 20%The reason noisy behind image this contradictory classification scene is that ifwith we compare98.84 50%% noisyaccuracy. images against CDAE- CNNthe 20% confirms noisy image datathe set, seconda larger number position of pixels are with found to be98.69% forcefully accuracy.turned ON/OF F forThe accuracy 50% noisy images. That’s why, whenever the frontier DVAE trained to work with 20% noisy images is recordedfed with 50%for noisyDAE-CNN, images it forces DAE-CDAE-CNN, a portion of the turned OFFDVAE-CDAE-CNN pixels due to zero mask noiseand tosimple get CNNturned architectures ON. As the following are CDAE 98.01%, works 97.40%,with this intermediate 97.43% image and it reconstructs 97.76% otherespectively.r affected In pixelsFig. completely 12(b) it making is clearly the classification visible task that for thewhenever CNN easier. theFor thesame very sametest reasons, set images the DAE -are CDAE-CNN architecture performs better than DAE-CNN, CDAE-CNN, DVAE-CNN architectures and corruptedachieves classification with 50% accuracy noise, of 96.34%. the DVAE-CDAE-CNN The 50% noisy image recognition architecture accuracy obtained surpasses by allDAE other-CNN, models CDA-CNN, attaining DVAE-CNN 96.74% are 95.01%, accuracy 94.22% and whereas 95.63% respectively. it shows When least these accurate 50% classificationnoisy data are fed result to the CNN in theclassifier previous without anycase. additional The denoisingreason andbehind reconstruction this contradictory process, the performance shown by the simple CNN is the worst and attains the lowest classification accuracy scene(85.15%) is thatcompared if we to the compare other models. 50% Fig. 12(b)noisy supports images the fact against that, whenever the 20% each of noisy these hybrid image supervised classifiers give more than 95% accuracy with just 50 iterations, a simple CNN’s accuracy was 252 18

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 data set, a larger number of pixels are found to be forcefully turned ON/OFF for 50% noisy images. That’s why, whenever the frontier DVAE trained to work with 20% noisy images is fed with 50% noisy images it forces a portion of the turned OFF pixels due to zero mask noise to get turned ON. As the following CDAE works with this intermediate image it reconstructs other affected pixels completely making the classification task for the CNN easier. For the very same reasons, the DAE-CDAE-CNN architecture performs better than DAE-CNN, CDAE-CNN, DVAE-CNN architectures and achieves classification accuracy of 96.34%. The 50% noisy image recognition accuracy obtained by DAE-CNN, CDA-CNN, DVAE-CNN are 95.01%, 94.22% and 95.63% respectively. When these 50% noisy data are fed to the CNN classifier without any additional denoising and reconstruction process, the performance shown by the simple CNN is the worst and attains the lowest classification accuracy (85.15%) compared to the other models. Fig. 12(b) supports the fact that, whenever each of these hybrid supervised classifiers give more than 95% accuracy with just 50 iterations, a simple CNN’s accuracy was less

99 96

98 92 CDAE-CNN

DAE-CNN

DAE-CDAE-CNN DVAE-CNN 97 CDAE-CNN 88 DVAE-CDAE-CNN Accuracy DAE-CNN Accuracy CNN DAE-CDAE-CNN 96 DVAE-CNN 84 DVAE-CDAE-CNN CNN

95 80 0 100 200 300 400 0 100 200 300 400 iteration iteration

(a) 20% noise (b) 50% noise

Figure 12. Test set recognition accuracy over MNIST dataset with batch size 50 and learning rate 1.0 for Figuredifferent 12.networks. Test set recognition accuracy over MNIST dataset with batch size 50 and learning rate 1.0 for different networks. than 85% at that time. In times of few initial iterations, the test set accuracy was lower compared to later iterations. This incident was not unexpected as these samples were not conspicuous by these hybrid networks during training period. Still, the classification accuracy improved significantly for test image sets quickly at a lower number of iterations (e.g. up to 100).

Table 1 details the classification of each class individually by all the proposed hybrid models for test set samples after fixed 400 epochs with 20% noise. For

253 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 (%) 98.01 98.69 98.84 97.40 97.43 95.01 94.22 95.63 96.34 96.74 Accuracy 9 978 977 980 979 980 959 928 948 954 960 8 994 995 984 983 938 956 958 966 985 947 7 983 987 988 980 980 965 960 968 975 977 6 983 988 989 985 985 948 941 956 962 966 5 976 978 938 940 917 937 944 955 957 929 om MNIST Dataset om MNIST 4 983 984 940 944 925 947 958 958 968 936 Accurate Classification 3 984 986 965 965 928 944 955 959 976 938 (out of 1000 test sample each class) 2 983 989 990 981 981 957 951 960 970 972 1 992 994 990 991 990 961 955 965 971 973 0 995 997 999 997 997 981 979 982 987 988 Models DAE-CNN CDAE-CNN DVAE-CNN DAE-CDAE-CNN DVAE-CDAE-CNN DAE-CNN CDAE-CNN DVAE-CNN DAE-CDAE-CNN DVAE-CDAE-CNN 20% 50% Level Noise Table 1 Table Classification Performance of the Hybrid Models in Case Individual Objects fr

254 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

DAE-CNN and CDAE-CNN, it is obviously noticeable from the table that they most exceedingly horrendously performed for the numeral “5” and out of 1000 test cases, they classified accurately 957 and 976 cases respectively. The DVAE-CNN classifier performed worst while classifying “5” as well. Still, it performed better than all the other models. For numeral “0” it showed the best classification result. In 999 cases, out of 1000 cases it classified “0” correctly. The DAE-CDAE-CNN and DVAE-CDAE-CNN architecture also misclassified the same digit 62 and 60 times respectivelyly. These perplexities in a couple of manually written numeral images are a result of various handwriting styles of individuals, and furthermore, the arbitrary noise injected in the images slightly misconstrue the patterns with each other. In any case, the proposed models have accomplished best classification for “0” by classifying it correctly 995, 997, 999, 997, 997 cases out of 1000 experiments for DAE-CNN, CDAE-CNN, DVAE-CNN, DAE- CDAE-CNN and DVAE-CDAE-CNN individually. Among the majority of the cases DVAE-CNN accomplished a decent noisy image classification task misclassifying just 116 cases though DAE-CNN, CDAE-CNN, DAE- CDAE-CNN and DVAE-CDAE-CNN misclassified 199, 131, 260 and 257 cases respectively. In the case of 50% noisy images, the worst case occurred with all the models while classifying numeral character “5”. It was misclassified in 71 cases throughout the experiments. The accuracy achieved by the CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE- CNN architectures in case of classifying “5” are 91.7%, 93.7%, 94.4% and 95.5% respectively. The accuracies calculated in the case of classifying “3” are quite similar: 93.8% for DAE-CNN, 92.8% for CDAE-CNN, 94.4% for DVAE-CNN, 95.5% for DAE-CDAE-CNN and 95.9% for DVAE-CDAE- CNN architectures. These models performed best for classifying numeral “0”. In this case, the classification precisions for DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models were 98.1%, 97.9%, 98.2%, 98.7% and 98.8% respectively. Moreover, in most of the cases the occurrences of other numerals misclassified as “0” were few. For all models, the error that occurred mostly were for numerals “8”, “9” , “5” and “3” in the descending order. As this time test set images were corrupted with 50% noises, the shapes of the handwritten numerals were almost totally distorted resulting in less accurate classification performances for the hybrid models than in the case of 20% noisy images because when the images were adulterated with 50% noise it was quite difficult to recognize them even with clear eyes.

255 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Table 2

Sample Handwritten Numeral Images along with their Original and Predicted Table 2.Class Sample Labels Handwritten Numeral Images along with their Original and Predicted Class Labels

Classification using Sample Actual Classification using Sample Actual Image Label DAE- DVAE- Image Label DAE-CNN CDAE-CNN DVAE-CNN DAE-CDAE- DVAE- DAE-CNN CDAE-CNN DVAECDAE-CNN-CNN CDAE-CNN CNN CDAE-CNN

2 2 2 2 2 2 2 8 8 8 8

4 6 6 4 4 4 4 8 8 8 8 4 6 4 4 8 8

3 3 3 5 5 3 3 3 3 3 3 3 3 5 3 3 3

7 2 2 2 2 2 2 9 9 9 9

Table 2Table demonstrates 2 demonstrates some some handwritten handwritten numeral numeral image imagess and and thetheirir corresponding corresponding class labels in the original classas well labels as in in the the reconstructed original as well form. as Itin is the cle reconstructedarly seen that form.the first It isimage clearly wa s classified correctly as “2” whenseen thatreconstructed the first imagewith DAEwas , classifiedCDAE and correctly DVAE , asb ut“2” the whenreconstruction reconstructed using DAE-CDAE as with DAE, CDAE and DVAE, but the reconstruction using DAE-CDAE well as DVAE-CDAE distorted the pattern, thereby causing the CNN to classify it as “8”. Numeral “4” as well as DVAE-CDAE distorted the pattern, thereby causing the CNN to was classified correctly when reconstructed using CDAE and DVAE, but misclassified as “6” in the case was classifiedclassify correctly it as “8”. when Numeral reconstructed “4” was usingclassified CDAE correctly and DVAE when, but reconstructed misclassified as “6” in the case of DAEusing-CNN CDAEand “8” and by DVAE,DAE-CDAE but misclassified-CNN as well as as “6” DVAE in the-CDAE case -ofCNN DAE-CNN. However, the third pattern was classifiedand “8” correctly by DAE-CDAE-CNN using all the modelsas well expectas DVAE-CDAE-CNN. CDAE-CNN which However, misclassified the it as “5”. On the other hand,third the pattern fourth was pattern classified from correctlythe table usingwas misclassifiedall the models by expect all of CDAE-CNNthe networks. It is important to that all whichof these misclassified patterns are it pretty as “5”. difficult On the to other identify hand, even the fourthby human patterns because from the of the diverse writing styles oftable different was misclassified persons and adding by all ofnoise the withnetworks. these It ambiguous is important patterns to that makesall of these their classification even patterns are pretty difficult to identify even by humans because of the diverse more difficult. writing styles of different persons and adding noise with these ambiguous patterns makes their classification even more difficult.

Table 3 portrays the consequences of the proposed techniques with different prominent works. It, moreover, displays specific feature(s) of individual procedures. It is striking that the proposed models did not use any feature extraction procedure while the vast majority of the current techniques use possibly more than one or maybe a couple of feature extraction techniques. Without utilizing any extra technique for feature extraction, the proposed DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE- CDAE-CNN models seem to beat the existing strategies. According to the

256

21

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 table, test set classification accuracies when they are corrupted with 50% noise are 95.01%, 94.22%, 95.63%, 96.34%, and 96.74% and with 20% noisy test set images, the accuracies are 98.01%, 98.69%, 98.84%, 97.40 % and 96.74% for DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN architectures individually. It is clearly visible from the table that DAE-CNN, CDAE-CNN, DVAE-CNN outperform the other two models along with the existing models in case of classifying images subject to regular noise. But whenever it comes to classifying massive noisy images the DAE-CDAE-CNN and DVAE-CDAE-CNN play the frontier role.

Classification Result on CIFAR-10 Dataset

In this section, the proposed methods were tested over the CIFAR-10 data set. From the grey-scaled CIFAR-10 dataset, 50000 sample images were considered for training purpose and 10000 for testing. Sample images were uniformly distributed over the elementary 10 classes. Noiseless initial data were adulterated with 20% Gaussian noise for the training of the autoencoders, whereas 50% noise was injected for the testing purpose only with the intention to check the performances of the proposed models in the noisy environment. Fig. 13 displays some reconstructed images after the noise removal steps using DAE, CDAE, DVAE, DAE-CDAE, DVAE-CDAE respectively in case the images were corrupted with 50% noise. Without any doubt, the DVAE- CDAE provides the best reconstruction in the case of 50% noisy images and the reconstructed images displayed in the figure show the evidence of the statement.

Table 3

A Comparative Description of the Proposed DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN with Some Contemporary Methods

Work reference Classification Noise Recog. acc. DBN 0% 98.50% Bengio et al. (2007) Deep net 0% 98.4% Shallow net 0% 95% Glorot (2011) Sparse rectifier neural network 25% 98.43% Vincent et al. (2008) SDAE-3 10% 97.20% % SVM 25% 98.37% Vincent et al. (2010) SDAE-3 25% 98.5% 20% 98.01% Proposed DAE-CNN CNN 50% 95.01% (continued)

257 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Work reference Classification Noise Recog. acc. Proposed CDAE- 20% 98.69% CNN CNN 50% 94.22% Proposed DVAE- 20% 98.84% CNN CNN 50% 95.63% Proposed DAE- 20% 97.40 % CNN CDAE-CNN 50% 96.34% Proposed DVAE- 20% 97.43 % CNN CDAE-CNN 50% 96.74%

Original image

50% noise added

DAE

CDAE age

DVAE

Reconstructed im Reconstructed DAE – CDAE

DVAE-CDAE

Figure 13. Sample of original images from CIFAR -10 dataset with and without noise and their Figurereconstruction 13. usingSample AEs. of original images from CIFAR -10 dataset with and without50% noise. noise Without and any their doubt, reconstruction the DVAE-CDAE usingprovides AEs. the best reconstruction in the case of 50%

noisy images and the reconstructed images displayed in the figure show the evidence of the statement.

FigureFigure 14 14 shows shows the noisy the- image noisy-image classification classificationaccuracy of the different accuracy proposed modelsof thealong differentwith a proposedsimple CNN models in case the along images with were acorrupted simple with CNN 20% innoise case as well the as images 50% noise. were The reportedcorrupted values withwere 20% captured noise up to as 400 well iterations. as 50% Figure noise. 14(a) depictsThe reported the fact that values DVAE -wereCNN performs captured the bestup towith 20% noisy images achieving an accuracy of 62.8%. The classification performance of DAE-CNN, 400CDAE iterations.-CNN, DAE Figure-CDAE 14(a)-CNN, depictsDVAE-CDAE the -factCNN that and DVAE-CNNthe simple CNN performs are 62.37%, the 62.69%,best with61.88%, 20% 61.93% noisy and images 62.04% respectively. achieving Fig. an 14(b) accuracy shows the of classification 62.8%. Theaccuracy classification of the proposed performancemodels in the case of DAE-CNN,of classifying 50% CDAE-CNN, noisy image data DAE-CDAE-CNN, up to 400 epochs. This time, DVAE-CDAE- the DVAE-CDAE- CNNCNN andarchitecture the simple achieve dCNN the first are place 62.37%, with 53.91% 62.69%, accuracy. 61.88%, The DAE 61.93%-CNN, CDAE and-CNN, 62.04% DVAE - CNN and DAE-CDAE-CNN showed a reasonable classification accuracy (52.95%, 52.5%, 52.64% and 53.63% respectively.). It is clearly visible that the classification accuracy of these models degraded while working with the CIFAR-10 dataset compared to258 the performance of the models over the MNIST data set. The main reason behind this issue was that in the tiny pictures in the CIFAR-10 data set (32x32 sized) do not give a clear representation of the objects in the image within such a small region. Moreover, the objects were captured in images with different orientations.

23

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 respectively. Fig. 14(b) shows the classification accuracy of the proposed models in the case of classifying 50% noisy image data up to 400 epochs. This time, the DVAE-CDAE-CNN architecture achieved the first place with 53.91% accuracy. The DAE-CNN, CDAE-CNN, DVAE-CNN and DAE- CDAE-CNN showed a reasonable classification accuracy (52.95%, 52.5%, 52.64% and 53.63% respectively.). It is clearly visible that the classification accuracy of these models degraded while working with the CIFAR-10 dataset compared to the performance of the models over the MNIST data set. The main reason behind this issue was that in the tiny pictures in the CIFAR-10 data set (32x32 sized) do not give a clear representation of the objects in the image within such a small region. Moreover, the objects were captured in images with different orientationTable 4 details the classification accuracy for each individual object for the test set images after 400 epochs with 20% as well as 50% noise. All the models showed best classification accuracy for the object “Frog”. The DAE-CNN, CDAE-CNN and DVAE-CNN recognized it correctly 704, 702 and 707times respectively. Both the DAE-CDAE-CNN and the DVAE-CDAE-CNN accurately classified it in 699 cases. The worst classification happened while classifying the object “Deer”. Even, the DVAE- CNN architecture that showed the best performance while classifying 20% noisy images misclassified it in 49% cases. In case of classifying 50% noisy images, all the models performed worst for the object “Deer”. The CDAE-CNN architecture misclassified it 626 times which was the highest misclassification result. The DAE-CNN architecture classified it correctly only 7 times more than the CDAE-CNN architecture. The DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN architecture showed 42.9%, 37.6% and 39.4% accuracy respectively. The classification result was not up to the mark for the object “Dog”. The DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models’ classification accuracies in the case of classifying “Dog” were 39.8%, 39.4%, 39.5%, 40.9% and 41.1% respectively. These models performed best for the object “Frog” with 63%, 62.8%, 62.8%, 63.5% and 63.8% accuracies achieved by DAE-CNN, CDAE-CNN, DVAE- CNN, DAE-CDAE-CNN, DVAE-CDAE-CNN respectively.

Table 5 demonstrates some sample images from the CIFAR-10 datasets and their corresponding class labels in the original as well as in the reconstructed form. The first image was of “Airplane”. All the models misclassified it as “Bird”. All the models except the DAE-CNN classified the second image correctly as “Horse”, whereas DAE-CNN classified it as “Dog”. The third image of a “Ship” was classified accurately by all the models. The last image was of “Truck”. Only DAE-CDAE-CNN and DVAE-CDAE-CNN classified it correctly.

259 Table 4 details the classification accuracy for each individual object for the test set images after 400 epochs with 20% as well as 50% noise. All the models showed best classification accuracy for the object “Frog”. The DAE-CNN, CDAE-CNN and DVAE-CNN recognized it correctly 704, 702 and 707times respectively. Both the DAE-CDAE-CNN and the DVAE-CDAE-CNN accurately classified it in 699 cases. The worst classification happened while classifying the object “Deer”. Even, the DVAE-CNN architecture that showed the best performance while classifying 20% noisy images misclassified it in 49% cases. In case of classifying 50% noisy images, all the models performed worst for the object “Deer”. The CDAE-CNN architecture misclassified it 626 times which was the highest misclassification result. The DAE-CNN architecture classified it correctly only 7 times more than the CDAE-CNN architecture. The DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN architecture showed 42.9%, 37.6% and 39.4% accuracy respectively. The classification result was not up to the mark for the object “Dog”. The DAE- CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models’ classification accuracies in the case of classifying “Dog” were 39.8%, 39.4%, 39.5%, 40.9% and 41.1% respectively. These models performed best for the object “Frog” with 63%, 62.8%, 62.8%, 63.5% and 63.8% accuracies achieved by DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN, DVAE-CDAE-CNN respectively.

Table 5 demonstrates some sample images from the CIFAR-10 datasets and their corresponding class labels in the original as well as in the reconstructed form. The first image was of “Airplane”. All the models misclassified it as “Bird”. All the models except the DAE-CNN classified the second image correctly as “Horse”, whereas DAE-CNN classified it as “Dog”. The third image of a “Ship” was classified accurately by all the models. The last image was of “Truck”. Only DAE-CDAE-CNN and Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

(a) 20% noise (b) 50% noise

Figure 14. Test set recognition accuracy over CIFAR-10 dataset with batch size 50 and learning rate 1.0 for different networks. Figure 1. Cyber risk assessment (CRA) conceptual framework. 24

Table 4. Classification Performance of the Hybrid Models in Case of Individual Objects from CIFAR-10 Dataset.

Table 5 Accurate classification Noise (out of 1000 test samples of each class) Accuracy Models Sample Objects from CIFAR-10 Dataset along with their Original and Predicted Class Labels Table 5. Sample Objects from CIFAR-10 dataset along with Their Original and Predicted Class Labels Table 5. Sample Objects from CIFAR-10 dataset along with Their Original and Predicted Class Labels Table 5. Sample Objects from CIFAR-10 dataset along with Their Original and Predicted Class Labels Classified with hybrid methods Original Actual Classified with hybrid methods Classified withDAE- hybrid methodsDVAE- Originalimage labelActual DAE- CDAE- ClassifiedDVAE- with hybrid methods Original Actual CDAE- CDAE-DAE-CDAE- DVAE-CDAE- image label CNN CNN CNN DAE-CDAE- DVAE-CDAE- image label DAE-CNN CDAE-CNN DVAECNN-CNN DAECNN -CDAE- DVAE-CDAE- DAE-CNN CDAE-CNN DVAE-CNN CNN CNN CNN CNN AirplaneAirplane Bird Bird Bird Bird Bird BirdBird Bird Bird Bird Airplane Bird Bird Bird Bird Bird

HorseHorse Dog Dog Horse Horse Horse HorseHorse HorseHorse Horse Horse Dog Horse Horse Horse Horse

Ship Ship Ship Ship Ship Ship ShipShip Ship Ship Ship Ship Ship ShipShip Ship Ship Ship

Truck Automobile Automobile Automobile Truck Truck TruckTruck AutomobileAutomobileAutomobile AutomobileAutomobile AutomobileTruck TruckTruck Truck

DVAE -CDAE-CNN classified it correctly. DVAE -CDAE-CNN classified it correctly. Table 6 compares the result of the proposed hybrid noisy image classifiers with Table 6 compares the result of the proposed hybrid noisy image classifiers with other prominent works Tableother 6 comparesprominent theworks result while of theworking proposed over hybridthe CIFAR-10 noisy image data setclassifiers along with with other prominent works whilethe workingparticular over feature(s) the CIFAR of those-10 data models. set along As per with the the table, particular test set feature(s) accuracies of those models. As per the while working over the CIFAR-10 data set along with the particular feature(s) of those models. As per the table,with test 50% set noiseaccuracies were with52.95%, 50% 52.5%,noise we 52.64%,re 52.95%, 53.68% 52.5%, and 52.64%, 53.91%, 53.68% while and 53.91%, while with table, test set accuracies with 50% noise were 52.95%, 52.5%, 52.64%, 53.68% and 53.91%, while with thewith 20% the noise 20% test noise set test, accuracies set, accuracies were were 62.37%, 62.37%, 62.69%, 62.69%, 62.8%, 62.8%, 61.88% 61.88% and 61.93% for DAE-CNN, the 20% noise test set, accuracies were 62.37%, 62.69%, 62.8%, 61.88% and 61.93% for DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models respectively. From this CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models respectively. From this table, it is clearly visible that our proposed260 models outperform some of the existing prominent models in table, it is clearly visible that our proposed models outperform some of the existing prominent models in the case of classifying noisy images, especially when the images were subject to massive noises. the case of classifying noisy images, especially when the images were subject to massive noises. Moreover, the classifier was the CNN. All the autoencoders and their hybrid models served only for the Moreover, the classifier was the CNN. All the autoencoders and their hybrid models served only for the image denoising task. From the table, it is clearly observable that without prior image denoising by the image denoising task. From the table, it is clearly observable that without prior image denoising by the Table 6. A Comparative Description of the Proposed DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN Table 6. A Comparative Description of the Proposed DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN andTable DV 6.AE A-CDAE Comparative-CNN with Description Some Contemporary of the Proposed Methods DAE while-CNN, E CDAExperimenting-CNN, DVAE over CIFAR-CNN, -DAE10. -CDAE-CNN and DVAE-CDAE-CNN with Some Contemporary Methods while Experimenting over CIFAR-10. and DVAE-CDAE-CNN with Some Contemporary Methods while Experimenting over CIFAR-10.

Work reference Classification Noise Recog. acc. Work reference Classification Noise Recog. acc. Glorot et al. (2011) Sparse rectifier Glorot et al. (2011) Sparse rectifier 25% 50.48% Glorot et al. (2011) neural network 25% 50.48% neural network 25% Traditional CNN (LeCun et al., 20% 62.04% Traditional CNN (LeCun et al., CNN 20% 62.04% Traditional CNN1998) (LeCun et al., CNN 20% 62.04% 1998) CNN 50% 41.42% 1998) 50% 41.42% 20% 62.37% Proposed DAE-CNN CNN 20% 62.37% Proposed DAE-CNN CNN 50% 52.95% 50% 52.95% 20% 62.69% Proposed CDAE-CNN CNN 20% 62.69% Proposed CDAE-CNN CNN 50% 52.5% 50% 52.5% 20% 62.8 % Proposed DVAE- CNN CNN 20% 62.8 % Proposed DVAE- CNN CNN 50% 52.64% 50% 52.64% 20% 61.88% Proposed DAE-CDAE-CNN CNN 20% 61.88% Proposed DAE-CDAE-CNN CNN 50% 53.68% 50% 53.68% 20% 61.93% Proposed DVAE-CDAE-CNN CNN 20% 61.93% Proposed DVAE-CDAE-CNN CNN 50% 53.91% Proposed DVAE-CDAE-CNN CNN 50% 53.91% 50% 53.91%

25 25 25

Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269 and 61.93% for DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN models respectively. From this table, it is clearly visible that our proposed models outperform some of the existing prominent models in the case of classifying noisy images, especially when the images were subject to massive noises. Moreover, the classifier was the CNN. All the autoencoders and their hybrid models served only for the image denoising task. From the table, it is clearly observable that without prior image denoising by the autoencoders, the performance of the classifier would be disastrous. It is also notable that our models do not need to be trained with images corrupted with noises of different proportions. The DAE, CDAE, DVAE were trained with 20% noisy images only and the CNN was trained with noise- free raw images. Still, the DAE-CDAE-CNN and the DVAE-CDAE-CNN models classified 50% noisy images with very good classification accuracy omitting the necessity for the noisy-image classifiers to be trained with 50% noisy images. The cascading structures of the DAE-CDAE-CNN and DVAE- CDAE-CNN enabled them to show such great performances over the massive noisy data.

Table 6

A Comparative Description of the Proposed DAE-CNN, CDAE-CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN with some Contemporary Methods while Experimenting over CIFAR-10.

Work reference Classification Noise Recog. acc.

Sparse rectifier neural Glorot et al. (2011) 25% 50.48% network

20% 62.04% Traditional CNN (LeCun et al., 1998) CNN 50% 41.42% 20% 62.37% Proposed DAE-CNN CNN 50% 52.95% 20% 62.69% Proposed CDAE-CNN CNN 50% 52.5% 20% 62.8 % Proposed DVAE- CNN CNN 50% 52.64% 20% 61.88% Proposed DAE-CDAE-CNN CNN 50% 53.68% 20% 61.93% Proposed DVAE-CDAE-CNN CNN 50% 53.91%

261 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Significance of the Proposed Hybrid Models

There are several significant differences between the proposed hybrid methods and the existing ones in terms of noisy-image classification. Conventional models train AEs for denoising in a stacked or standalone way, whereas AEs are trained independently and then cascaded for denoising in any of the proposed hybrid models. The proposed methods use CNN as a classifier rather than MLP or other classifiers as CNN performs well for image classification. The experimental results on benchmark datasets revealed the effectiveness of the proposed hybrid models for both regular and massive noise.

Deep learning-based models have the dependency over training data; therefore, existing models perform well only when they work with the images corrupted with the very same proportion of noise as in the training data and performance degrades when noise level increases. Our proposed hybrid models DAE- CDAE-CNN and DVAE-CDAE-CNN have overcome this problem. Both the architectures are very good at classifying images injected with massive noisy data even if they are trained with images corrupted with regular noise. The underlying cascaded structures of these two models make it possible for them to perform well in this case. These two models use two AEs as image denoiser and both the AEs are trained to reconstruct native images from images subject to the same level of regular noises. So, in both cases, the frontier AE omits a proportion of noise from the input image and the reconstructed image is passed to the following AE for further filtering. As a result, whenever the percentage of noise is massive in the input images these two noisy-image classifiers perform better than other models.

On the other hand, the proposed DAE-CNN, CDAE-CNN and DVAE-CNN models performed well for regular noise. As single AEs in these three models are trained with regular level noisy images, their standalone structure is sufficient to denoise regular noisy images which are later easy to classify with CNN.

CONCLUSION

Conventional image classifiers perform really well with preprocessed data generated in the laboratory. But when they are employed to classify real world data, most often these images are corrupted with noise during acquisition and transmission. As a result, there is a high chance that they would fail drastically when applied in real life tasks. The solution to this problem is to denoise the images prior to feeding to the classifier. This research work proposed five supervised deep architectures named DAE-CNN, CDAE-

262 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

CNN, DVAE-CNN, DAE-CDAE-CNN and DVAE-CDAE-CNN among which the first three perform well when the images are subjected to a small amount of noise, whereas, the last two are for classifying massive noisy data. These models utilize the ideas of various autoencoders and along with CNN construct classifiers for noisy image data. These deep models have the ability to filter noise from the image data and classify them by learning latent feature representations from them. These models’ classification accuracy over MNIST and CIFAR-10 datasets (corrupted with noise of different proportions) gives evidence that they have the capability to learn hierarchical representations of the images. Still, there are scopes for further developments in future. The different hybrid models proposed here are good with a different level of noises. Our future research work would focus on building a standalone model using these techniques that would be able to classify images adulterated with any proportion of noise.

REFERENCES

Agostinelli, F., Anderson, M. R., & Lee, H. (2013). Adaptive multi-column deep neural networks with application to robust image denoising. Proceedings of the Advances in Neural Information Processing Systems, 1493-1501.

Akhand, M. A. H., Ahmed, M., & Rahman, M. H. (2016). Convolutional neural network based handwritten Bengali and Bengali-English mixed numeral recognition. International Journal of Image, Graphics and , 8(9), 40-50. doi: 10.5815/ijigsp.2016.09.06

Akhand, M. A. H., Ahmed, M., Rahman, M. H. & Islam, M. M. (2017). Convolutional neural network training incorporating rotation based generated patterns and handwritten numeral recognition of major indian scripts. IETE Journal of Research (TIJR), Taylor & Francis, 63(Online), 19 pages. doi: 10.1080/03772063.2017.1351322

Arigbabu, O. A., Ahmad, S. M. S., Adnan, W. A. W., Yussof, S., & Mahmood, S. (2017). Soft biometrics: Gender recognition from unconstrained face images using local feature descriptor. arXiv Preprint arXiv:1702.02537.

Bar, Y., Diamant, I., Wolf, L., & Greenspan, H. (2015, March). Deep learning with non-medical training used for chest pathology identification. Proceedings of Society for Optics and Photonics, 94140V-94140V. doi: 10.1117/12.2083124.

263 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer- wise training of deep networks. Proceedings of the Advances in Neural Information Processing Systems, 153-160.

Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1-127.

Bezdek, J. C., Hall, L. O., & Clarke, L. (1993). Review of MR image segmentation techniques using pattern recognition. Medical Physics, 20(4), 1033-1048.

Bosch, A., Zisserman, A., & Munoz, X. (2007, July). Representing shape with a spatial pyramid kernel. Proceedings of the 6th ACM International Conference on Image and Retrieval, 401-408.

Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4), 291- 294.

Bouvrie, J. (2006). Notes on convolutional neural networks, Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences, MIT, Cambridge.

Burger, H. C., Schuler, C. J., & Harmeling, S. (2012, June). Image denoising: Can plain neural networks compete with BM3D?. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2392-2399.

Cho, K. (2013). Boltzmann machines and denoising autoencoders for image denoising. arXiv Preprint arXiv:1301.3468.

Cireşan, D. C., Meier, U., Masci, J., Gambardella, L. M., & Schmidhuber, J. (2011). High-performance neural networks for visual object classification.arXiv Preprint arXiv:1102.0183.

Cireşan, D., Meier, U., Masci, J., & Schmidhuber, J. (2011, July). A committee of neural networks for traffic sign classification. Proceedings of the 2011 International Joint Conference on Neural Networks (IJCNN), 1918-1921.

Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2011, September). Convolutional neural network committees for handwritten

264 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

character classification. Proceedings of the 2011 International Conference on Document Analysis and Recognition (ICDAR), 1135- 1139.

Coates, A., Ng, A., & Lee, H. (2011, June). An analysis of single-layer networks in unsupervised feature learning. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 215- 223.

Coifman, R. R., & Donoho, D. L. (1995). Translation-invariant de-noising. Wavelets and Statistics, 103, 125-150.

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.

Du, B., Xiong, W., Wu, J., Zhang, L., Zhang, L., & Tao, D. (2017). Stacked convolutional denoising auto encoders for feature representation. IEEE Transactions on Cybernetics, 47(4), 1017-1027.

Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12), 3736-3745.

Glorot, X., Bordes, A., & Bengio, Y. (2011, June). Deep sparse rectifier neural networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315-323.

Gondara, L. (2016, December). Medical image denoising using convolutional denoising autoencoders. Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW), 241- 246.

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527-1554.

Huang, F. J., & LeCun, Y. (2006). Large-scale learning with svm and convolutional nets for generic object recognition. Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1-8.

Im, D. J., Ahn, S., Memisevic, R., & Bengio, Y. (2017). Denoising criterion for variational auto-encoding framework. Association for the Advancement of Artificial Intelligence, 2059-2065.

265 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Jain, V., & Seung, S. (2009). Natural image denoising with convolutional networks. Proceedings of the Advances in Neural Information Processing Systems, 769-776.

Kingma, D. P., & Welling, M. (2013). Auto encoding variational bayes. arXiv Preprint arXiv:1312.6114.

Kingma, D. P., & Welling, M. (2014). Stochastic gradient VB and the variational auto encoder. Proceedings of the Second International Conference on Learning Representations (ICLR), 1-14.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Proceedings of the Advances in Neural Information Processing Systems, 1097-1105.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. doi: 10.1109/5.726791

LeCun, Y., Cortes, C., & Burges, C. J. (2010). MNIST handwritten digit database. AT&T Labs [Online]. Retrieved from: http://yann. lecun. com/exdb/mnist

Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proceedings of the 26th Annual International Conference on Machine Learning, 609-616.

Liu, T., Fang, S., Zhao, Y., Wang, P., & Zhang, J. (2015). Implementation of training convolutional neural networks. arXiv Preprint arXiv:1506.01195.

Lu, D., & Weng, Q. (2007). A survey of image classification methods and techniques for improving classification performance. International Journal of Remote Sensing, 28(5), 823-870.

Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2009). Online dictionary learning for sparse coding. Proceedings of the 26th Annual International Conference on Machine Learning, 689-696.

Masci, J., Meier, U., Cireşan, D., & Schmidhuber, J. (2011). Stacked convolutional auto-encoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning (ICANN), 52-59.

266 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Matsugu, M., Mori, K., Mitari, Y., & Kaneda, Y. (2003). Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Networks, 16(5), 555-559.

Norouzi, M., Ranjbar, M., & Mori, G. (2009, June). Stacks of convolutional restricted Boltzmann machines for shift-invariant feature learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVRP), 2735-2742.

Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311- 3325.

Perona, P., & Malik, J. (1990). Scale-space and edge detection using . IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), 629-639.

Rudin, L. I., & Osher, S. (1994). Total variation-based image restoration with free local constraints. Proceedings of the IEEE International Conference on Image Processing, 1, 31-35.

Sanches, J. M., Nascimento, J. C., & Marques, J. S. (2008). Medical- image using the Sylvester–Lyapunov equation. IEEE Transactions on Image Processing, 17(9), 1522-1539.

Scherer, D., Müller, A., & Behnke, S. (2010). Evaluation of pooling operations in convolutional architectures for object recognition. Proceedings of the International Conference on Artificial Neural Networks (ICANN), 92-101.

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.

Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 806-813.

Shin, H. C., Orton, M. R., Collins, D. J., Doran, S. J., & Leach, M. O. (2013). Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D-patient data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1930-1943.

267 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Singh, A. K., Shukla, V. P., Biradar, S. R., & Tiwari 1, S. (2014). Multiclass noisy image classification based on optimal threshold and neighboring window denoising. International Journal of Computer Engineering Science (IJCES), 4(3), 1-11.

Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. Advances in Neural Information Processing Systems, 163-171.

Subakan, O., Jian, B., Vemuri, B. C., & Vallejos, C. E. (2007). Feature- preserving image-smoothing using a continuous mixture of tensors. Proceedings of the 11th International Conference on Computer Vision (ICCV), 1-6.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv Preprint arXiv:1312.6199.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, 318-362.

Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., et al. (2010). Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22(2), 511-538.

Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning, 1096-1103.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec), 3371-3408.

Xie, J., Xu, L., & Chen, E. (2012). Image denoising and inpainting with deep neural networks. Proceedings of the Advances in Neural Information Processing Systems, 341-349.

Xu, L., Ren, J. S., Liu, C., & Jia, J. (2014). Deep convolutional neural network for image deconvolution. Proceedings of the Advances in Neural Information Processing Systems, 1790-1798.

268 Journal of ICT, 18, No. 2 (April) 2018, pp: 233–269

Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). Deconvolutional networks. Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2528-2535.

Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid- and high-level feature learning. Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV), 2018-2025.

269