Deep Classifier Structures with Autoencoder for Higher-Level Feature Extraction

Deep Classifier Structures with Autoencoder for Higher-level Feature Extraction Maysa I. A. Almulla Khalaf1,2 and John Q. Gan1 1School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, CO4 3SQ, Colchester, Essex, U.K. 2Department of Computer Science, Baghdad University, Baghdad, Iraq Keywords: Stacked Autoencoder, Deep Learning, Feature Learning, Effective Weight Initialisation. Abstract: This paper investigates deep classifier structures with stacked autoencoder (SAE) for higher-level feature extraction, aiming to overcome difficulties in training deep neural networks with limited training data in high- dimensional feature space, such as overfitting and vanishing/exploding gradients. A three-stage learning algorithm is proposed in this paper for training deep multilayer perceptron (DMLP) as the classifier. At the first stage, unsupervised learning is adopted using SAE to obtain the initial weights of the feature extraction layers of the DMLP. At the second stage, error back-propagation is used to train the DMLP by fixing the weights obtained at the first stage for its feature extraction layers. At the third stage, all the weights of the DMLP obtained at the second stage are refined by error back-propagation. Cross-validation is adopted to determine the network structures and the values of the learning parameters, and test datasets unseen in the cross-validation are used to evaluate the performance of the DMLP trained using the three-stage learning algorithm, in comparison with support vector machines (SVM) combined with SAE. Experimental results have demonstrated the advantages and effectiveness of the proposed method. 1 INTRODUCTION For classification tasks, supervised learning is more desirable using support vector machines In recent years, deep learning for feature extraction (Vapnik, 2013) or feedforward neural networks as has attracted much attention in different areas such as classifiers. How to effectively combine supervised speech recognition, computer vision, fraud detection, learning with unsupervised learning is a critical issue social media analysis, and medical informatics to the success of deep learning for pattern (LeCun et al., 2015; Hinton and Salakhutdinov, classification (Glorot and Bengio, 2010). 2006; Najafabadi et al., 2015; Chen and Lin, 2014; Other major issues in deep learning include the Hinton et al., 2012; Krizhevsky et al., 2012; Ravì et overfitting problem and vanishing/exploding al., 2017). One of the main advantages of deep gradients during error back-propagation due to learning due to the use of deep neural network adopting deep neural network structures such as deep structures is that it can learn feature representation, multilayer perceptron (DMLP) (Glorot and Bengio, without separate feature extraction process that is a 2010; Geman et al., 1992). very significant processing step in pattern recognition Many techniques have been proposed to solve the (Bengio et al., 2013; Bengio, 2013). problems in training deep neural networks. Hinton et Unsupervised learning is usually required for al. (2006) introduced the idea of greedy layer-wise feature learning such as feature learning using pre-training. Bengio et al. (2007) proposed to train the restricted Boltzmann machine (RBM) (Salakhutdinov layers of a deep neural network in a sequence using and Hinton, 2009), sparse autoencoder (Lee, 2010; an auxiliary objective and then “fine-tune” the entire Abdulhussain and Gan, 2015), stacked autoencoder network with standard optimization methods such as (SAE) (Gehring et al., 2013, Zhou et al., 2015), stochastic gradient descent. Martens (2010) showed denoising autoencoder (Vincent et al., 2008, Vincent that truncated-Newton method has the ability to train et al., 2010), and contractive autoencoder (Rifai et al., deep neural networks from certain random 2011). initialisation without pre-training; however, it is still 31 Khalaf, M. and Gan, J. Deep Classiﬁer Structures with Autoencoder for Higher-level Feature Extraction. DOI: 10.5220/0006883000310038 In Proceedings of the 10th International Joint Conference on Computational Intelligence (IJCCI 2018), pages 31-38 ISBN: 978-989-758-327-8 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved IJCCI 2018 - 10th International Joint Conference on Computational Intelligence inadequate to resolve the training challenges. It is known that most deep learning models are incapable with random initialisation (Martens, 2010, Mohamed et al., 2012, Glorot and Bengio, 2010b, Chapelle and Erhan, 2011). Effective weight initialisation or pre-training has been widely explored for avoiding vanishing/exploding gradients (Yam and Chow, 2000, Sutskever et al., 2013, Fernandez-Redondo and Hernandez-Espinosa, 2001, DeSousa, 2016, Sodhi et al., 2014). Using a huge amount of training data can Figure 1: Multilayer autoencoder. overcome overfitting to some extent (Geman et al., 1992). However, in many applications there is no A stacked autoencoder may have three or more large amount of training data available or there is hidden layers, but for simplicity an autoencoder with insufficient computer power to handle a huge amount just a single hidden layer is described in detail as of training data, and thus regularisation techniques follows. The connection weights and bias parameters such as sparse structure and dropout technique are can be denoted as = widely used for combatting overfitting (Zhang et al., w [vectorised W1; vectorised W2 ; b1; b2 ] , 2015, Shu and Fyshe, 2013, Srivastava et al., 2014). where W ∈ R K × N is the encoding weight matrix, This paper investigates deep classifier structures 1 W ∈ R N×K is the decoding weight matrix, b ∈RK is with stacked autoencoder, aiming to overcome 2 1 ∈ N difficulties in training deep neural networks with the encoding bias vector, and b2 R is the limited training data in high-dimensional feature decoding bias vector. space. Experiments were conducted on three datasets, For a training dataset, let the output matrix of the with the performance of the proposed method autoencoder be O = [o 1 , o 2 ,..., o m ] , which is evaluated by comparison with existing methods. This supposed to be the reconstruction of the input matrix paper is organized as follows: Section 2 describes the i N i N basic principles of the stacked sparse autoencoder, X = [ x 1 , x 2 ,..., x m ] , where o ∈R and x ∈R deep multilayer perceptron and the proposed are the output vector and input vector of the auto- approach. Section 3 presents the experimental results encoder respectively, and m is the number of samples. and discussion. Conclusion is drawn in Section 4. Correspondingly, let the hidden output matrix be H = [h1 , h 2 ,...,h m ], where h i ∈ R K is the hidden output vector of the autoencoder to be used as feature 2 STACKED SPARSE vector in feature learning tasks. AUTOENCODER, DEEP For the ith sample, the hidden output vector is MULTILAYER PERCEPTRON, defined as i = ()i + AND THE PROPOSED h g W1 x b1 (1) APPROACH and the output is defined by 2.1 Stacked Sparse Autoencoder i = ()i + o g W2h b2 (2) An autoencoder is an unsupervised neural network where g(x) is the sigmoid logistic function (1 + trained by using stochastic gradient descent (−)). algorithms, which learns a non-linear approximation For training an autoencoder with sparse of an identity function (Abdulhussain and Gan, 2016, representation, the learning objective function is Zhou et al., 2015, Zhang et al., 2015, Shu and Fyshe, defined as follows: 2013). Figure 1 illustrates a non-linear multilayer 2 mKii 2 autoencoder network, which can be implemented by JW()=++1 − λ β KLp()|| pˆ sparse 22m xo W j stacking two autoencoders, each with one hidden ij==11 layer. (3) where p is the sparsity parameter, pˆ j is the average th output of the j hidden node, averaged over all the 32 Deep Classiﬁer Structures with Autoencoder for Higher-level Feature Extraction samples, i.e., 2.3 Proposed Approach m pˆ = 1 h i (4) Training deep neural networks usually needs a huge j m j i=1 amount of training data, especially in high- and is the coefficient for L regularisation (weight dimensional input space. Otherwise, overfitting 2 would be a serious problem due to the high decay), and is the coefficient for sparsity control that is defined by the Kullback-Leibler divergence: complexity of the neural network model. However, in many applications the required huge amount of − KL()p || pˆ = plog p + ()1− p log1 p (5) training data may be unavailable or the computer j ˆ − ˆ p j 1 p j power available is insufficient to handle a huge amount of training data. With deep neural network The learning rule for updating the weight vector training, there may also be local minimum and (containing W , W , b , and b ) is error back- w 1 2 1 2 vanishing/exploding gradient problems without propagation based on gradient descent, i.e., proper weight initialisation. Deep classifier structures Δ = −η ⋅ W Wgrad . The error gradients with respect to with stacked autoencoder are investigated in this W1, W2, b1, and b2 are derived as follows respectively paper to overcome these problems, whose training (Abdulhussain and Gan, 2016, Zhang et al., 2015): process consists of the following three stages: T − T W grad = (W (O − X ) + β − p +1 p I ) 1) At the first stage, unsupervised learning is 1 2 ˆ − ˆ (6) p j 1 p j adopted to train a stacked autoencoder with .* g′(H)X T /m + λW random initial weights to obtain the initial 1 weights of the feature extraction layers of the T − T DMLP. The autoencoder consists of N input b grad = ((W (O − X) + β − p +1 p I ) 1 2 ˆ − ˆ (7) units, an encoder with two layers of K1 and K2 p j 1 p j neurons in each hidden layer respectively, a .*g′(H)I /m symmetric decoder, and N output units. Figure = − T + λ 3 illustrates its structure. W2 grad ((O X )H ) / m W2 (8) 2) At the second stage, error back-propagation is b grad = (O − X ) I / m (9) 2 employed to pre-train the DMLP by fixing the where () =().∗ [1 − ()] is the derivative weights obtained at the first stage for its feature extraction layers (W1 and W2).

Deep Classifier Structures with Autoencoder for Higher-Level Feature Extraction

Malware Classification with BERT

10 Oriented Principal Component Analysis for Feature Extraction

Automatic Feature Learning Using Recurrent Neural Networks

Training Autoencoders by Alternating Minimization

Exploring the Potential of Sparse Coding for Machine Learning

Fun with Hyperplanes: Perceptrons, Svms, and Friends

Text-To-Speech Synthesis

Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Introduction to Machine Learning

Feature Selection/Extraction

Feature Extraction (PCA & LDA)