Deep Classifier Structures with Autoencoder for Higher-Level Feature Extraction

Total Page:16

File Type:pdf, Size:1020Kb

Deep Classifier Structures with Autoencoder for Higher-Level Feature Extraction Deep Classifier Structures with Autoencoder for Higher-level Feature Extraction Maysa I. A. Almulla Khalaf1,2 and John Q. Gan1 1School of Computer Science and Electronic Engineering, University of Essex, Wivenhoe Park, CO4 3SQ, Colchester, Essex, U.K. 2Department of Computer Science, Baghdad University, Baghdad, Iraq Keywords: Stacked Autoencoder, Deep Learning, Feature Learning, Effective Weight Initialisation. Abstract: This paper investigates deep classifier structures with stacked autoencoder (SAE) for higher-level feature extraction, aiming to overcome difficulties in training deep neural networks with limited training data in high- dimensional feature space, such as overfitting and vanishing/exploding gradients. A three-stage learning algorithm is proposed in this paper for training deep multilayer perceptron (DMLP) as the classifier. At the first stage, unsupervised learning is adopted using SAE to obtain the initial weights of the feature extraction layers of the DMLP. At the second stage, error back-propagation is used to train the DMLP by fixing the weights obtained at the first stage for its feature extraction layers. At the third stage, all the weights of the DMLP obtained at the second stage are refined by error back-propagation. Cross-validation is adopted to determine the network structures and the values of the learning parameters, and test datasets unseen in the cross-validation are used to evaluate the performance of the DMLP trained using the three-stage learning algorithm, in comparison with support vector machines (SVM) combined with SAE. Experimental results have demonstrated the advantages and effectiveness of the proposed method. 1 INTRODUCTION For classification tasks, supervised learning is more desirable using support vector machines In recent years, deep learning for feature extraction (Vapnik, 2013) or feedforward neural networks as has attracted much attention in different areas such as classifiers. How to effectively combine supervised speech recognition, computer vision, fraud detection, learning with unsupervised learning is a critical issue social media analysis, and medical informatics to the success of deep learning for pattern (LeCun et al., 2015; Hinton and Salakhutdinov, classification (Glorot and Bengio, 2010). 2006; Najafabadi et al., 2015; Chen and Lin, 2014; Other major issues in deep learning include the Hinton et al., 2012; Krizhevsky et al., 2012; Ravì et overfitting problem and vanishing/exploding al., 2017). One of the main advantages of deep gradients during error back-propagation due to learning due to the use of deep neural network adopting deep neural network structures such as deep structures is that it can learn feature representation, multilayer perceptron (DMLP) (Glorot and Bengio, without separate feature extraction process that is a 2010; Geman et al., 1992). very significant processing step in pattern recognition Many techniques have been proposed to solve the (Bengio et al., 2013; Bengio, 2013). problems in training deep neural networks. Hinton et Unsupervised learning is usually required for al. (2006) introduced the idea of greedy layer-wise feature learning such as feature learning using pre-training. Bengio et al. (2007) proposed to train the restricted Boltzmann machine (RBM) (Salakhutdinov layers of a deep neural network in a sequence using and Hinton, 2009), sparse autoencoder (Lee, 2010; an auxiliary objective and then “fine-tune” the entire Abdulhussain and Gan, 2015), stacked autoencoder network with standard optimization methods such as (SAE) (Gehring et al., 2013, Zhou et al., 2015), stochastic gradient descent. Martens (2010) showed denoising autoencoder (Vincent et al., 2008, Vincent that truncated-Newton method has the ability to train et al., 2010), and contractive autoencoder (Rifai et al., deep neural networks from certain random 2011). initialisation without pre-training; however, it is still 31 Khalaf, M. and Gan, J. Deep Classifier Structures with Autoencoder for Higher-level Feature Extraction. DOI: 10.5220/0006883000310038 In Proceedings of the 10th International Joint Conference on Computational Intelligence (IJCCI 2018), pages 31-38 ISBN: 978-989-758-327-8 Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved IJCCI 2018 - 10th International Joint Conference on Computational Intelligence inadequate to resolve the training challenges. It is known that most deep learning models are incapable with random initialisation (Martens, 2010, Mohamed et al., 2012, Glorot and Bengio, 2010b, Chapelle and Erhan, 2011). Effective weight initialisation or pre-training has been widely explored for avoiding vanishing/exploding gradients (Yam and Chow, 2000, Sutskever et al., 2013, Fernandez-Redondo and Hernandez-Espinosa, 2001, DeSousa, 2016, Sodhi et al., 2014). Using a huge amount of training data can Figure 1: Multilayer autoencoder. overcome overfitting to some extent (Geman et al., 1992). However, in many applications there is no A stacked autoencoder may have three or more large amount of training data available or there is hidden layers, but for simplicity an autoencoder with insufficient computer power to handle a huge amount just a single hidden layer is described in detail as of training data, and thus regularisation techniques follows. The connection weights and bias parameters such as sparse structure and dropout technique are can be denoted as = widely used for combatting overfitting (Zhang et al., w [vectorised W1; vectorised W2 ; b1; b2 ] , 2015, Shu and Fyshe, 2013, Srivastava et al., 2014). where W ∈ R K × N is the encoding weight matrix, This paper investigates deep classifier structures 1 W ∈ R N×K is the decoding weight matrix, b ∈RK is with stacked autoencoder, aiming to overcome 2 1 ∈ N difficulties in training deep neural networks with the encoding bias vector, and b2 R is the limited training data in high-dimensional feature decoding bias vector. space. Experiments were conducted on three datasets, For a training dataset, let the output matrix of the with the performance of the proposed method autoencoder be O = [o 1 , o 2 ,..., o m ] , which is evaluated by comparison with existing methods. This supposed to be the reconstruction of the input matrix paper is organized as follows: Section 2 describes the i N i N basic principles of the stacked sparse autoencoder, X = [ x 1 , x 2 ,..., x m ] , where o ∈R and x ∈R deep multilayer perceptron and the proposed are the output vector and input vector of the auto- approach. Section 3 presents the experimental results encoder respectively, and m is the number of samples. and discussion. Conclusion is drawn in Section 4. Correspondingly, let the hidden output matrix be H = [h1 , h 2 ,...,h m ], where h i ∈ R K is the hidden output vector of the autoencoder to be used as feature 2 STACKED SPARSE vector in feature learning tasks. AUTOENCODER, DEEP For the ith sample, the hidden output vector is MULTILAYER PERCEPTRON, defined as i = ()i + AND THE PROPOSED h g W1 x b1 (1) APPROACH and the output is defined by 2.1 Stacked Sparse Autoencoder i = ()i + o g W2h b2 (2) An autoencoder is an unsupervised neural network where g(x) is the sigmoid logistic function (1 + trained by using stochastic gradient descent (−)). algorithms, which learns a non-linear approximation For training an autoencoder with sparse of an identity function (Abdulhussain and Gan, 2016, representation, the learning objective function is Zhou et al., 2015, Zhang et al., 2015, Shu and Fyshe, defined as follows: 2013). Figure 1 illustrates a non-linear multilayer 2 mKii 2 autoencoder network, which can be implemented by JW()=++1 − λ β KLp()|| pˆ sparse 22m xo W j stacking two autoencoders, each with one hidden ij==11 layer. (3) where p is the sparsity parameter, pˆ j is the average th output of the j hidden node, averaged over all the 32 Deep Classifier Structures with Autoencoder for Higher-level Feature Extraction samples, i.e., 2.3 Proposed Approach m pˆ = 1 h i (4) Training deep neural networks usually needs a huge j m j i=1 amount of training data, especially in high- and is the coefficient for L regularisation (weight dimensional input space. Otherwise, overfitting 2 would be a serious problem due to the high decay), and is the coefficient for sparsity control that is defined by the Kullback-Leibler divergence: complexity of the neural network model. However, in many applications the required huge amount of − KL()p || pˆ = plog p + ()1− p log1 p (5) training data may be unavailable or the computer j ˆ − ˆ p j 1 p j power available is insufficient to handle a huge amount of training data. With deep neural network The learning rule for updating the weight vector training, there may also be local minimum and (containing W , W , b , and b ) is error back- w 1 2 1 2 vanishing/exploding gradient problems without propagation based on gradient descent, i.e., proper weight initialisation. Deep classifier structures Δ = −η ⋅ W Wgrad . The error gradients with respect to with stacked autoencoder are investigated in this W1, W2, b1, and b2 are derived as follows respectively paper to overcome these problems, whose training (Abdulhussain and Gan, 2016, Zhang et al., 2015): process consists of the following three stages: T − T W grad = (W (O − X ) + β − p +1 p I ) 1) At the first stage, unsupervised learning is 1 2 ˆ − ˆ (6) p j 1 p j adopted to train a stacked autoencoder with .* g′(H)X T /m + λW random initial weights to obtain the initial 1 weights of the feature extraction layers of the T − T DMLP. The autoencoder consists of N input b grad = ((W (O − X) + β − p +1 p I ) 1 2 ˆ − ˆ (7) units, an encoder with two layers of K1 and K2 p j 1 p j neurons in each hidden layer respectively, a .*g′(H)I /m symmetric decoder, and N output units. Figure = − T + λ 3 illustrates its structure. W2 grad ((O X )H ) / m W2 (8) 2) At the second stage, error back-propagation is b grad = (O − X ) I / m (9) 2 employed to pre-train the DMLP by fixing the where () =().∗ [1 − ()] is the derivative weights obtained at the first stage for its feature extraction layers (W1 and W2).
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • 10 Oriented Principal Component Analysis for Feature Extraction
    10 Oriented Principal Component Analysis for Feature Extraction Abstract- Principal Components Analysis (PCA) is a popular unsupervised learning technique that is often used in pattern recognition for feature extraction. Some variants have been proposed for supervising PCA to include class information, which are usually based on heuristics. A more principled approach is presented in this paper. Suppose that we have a pattern recogniser composed of a linear network for feature extraction and a classifier. Instead of designing each module separately, we perform global training where both learning systems cooperate in order to find those components (Oriented PCs) useful for classification. Experimental results, using artificial and real data, show the potential of the proposed learning system for classification. Since the linear neural network finds the optimal directions to better discriminate classes, the global-trained pattern recogniser needs less Oriented PCs (OPCs) than PCs so the classifier’s complexity (e.g. capacity) is lower. In this way, the global system benefits from a better generalisation performance. Index Terms- Oriented Principal Components Analysis, Principal Components Neural Networks, Co-operative Learning, Learning to Learn Algorithms, Feature Extraction, Online gradient descent, Pattern Recognition, Compression. Abbreviations- PCA- Principal Components Analysis, OPC- Oriented Principal Components, PCNN- Principal Components Neural Networks. 1. Introduction In classification, observations belong to different classes. Based on some prior knowledge about the problem and on the training set, a pattern recogniser is constructed for assigning future observations to one of the existing classes. The typical method of building this system consists in dividing it into two main blocks that are usually trained separately: the feature extractor and the classifier.
    [Show full text]
  • Automatic Feature Learning Using Recurrent Neural Networks
    Predicting purchasing intent: Automatic Feature Learning using Recurrent Neural Networks Humphrey Sheil Omer Rana Ronan Reilly Cardiff University Cardiff University Maynooth University Cardiff, Wales Cardiff, Wales Maynooth, Ireland [email protected] [email protected] [email protected] ABSTRACT influence three out of four major variables that affect profit. Inaddi- We present a neural network for predicting purchasing intent in an tion, merchants increasingly rely on (and pay advertising to) much Ecommerce setting. Our main contribution is to address the signifi- larger third-party portals (for example eBay, Google, Bing, Taobao, cant investment in feature engineering that is usually associated Amazon) to achieve their distribution, so any direct measures the with state-of-the-art methods such as Gradient Boosted Machines. merchant group can use to increase their profit is sorely needed. We use trainable vector spaces to model varied, semi-structured input data comprising categoricals, quantities and unique instances. McKinsey A.T. Kearney Affected by Multi-layer recurrent neural networks capture both session-local shopping intent and dataset-global event dependencies and relationships for user Price management 11.1% 8.2% Yes sessions of any length. An exploration of model design decisions Variable cost 7.8% 5.1% Yes including parameter sharing and skip connections further increase Sales volume 3.3% 3.0% Yes model accuracy. Results on benchmark datasets deliver classifica- Fixed cost 2.3% 2.0% No tion accuracy within 98% of state-of-the-art on one and exceed state-of-the-art on the second without the need for any domain / Table 1: Effect of improving different variables on operating dataset-specific feature engineering on both short and long event profit, from [22].
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Exploring the Potential of Sparse Coding for Machine Learning
    Portland State University PDXScholar Dissertations and Theses Dissertations and Theses 10-19-2020 Exploring the Potential of Sparse Coding for Machine Learning Sheng Yang Lundquist Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds Part of the Applied Mathematics Commons, and the Computer Sciences Commons Let us know how access to this document benefits ou.y Recommended Citation Lundquist, Sheng Yang, "Exploring the Potential of Sparse Coding for Machine Learning" (2020). Dissertations and Theses. Paper 5612. https://doi.org/10.15760/etd.7484 This Dissertation is brought to you for free and open access. It has been accepted for inclusion in Dissertations and Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Exploring the Potential of Sparse Coding for Machine Learning by Sheng Y. Lundquist A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Melanie Mitchell, Chair Feng Liu Bart Massey Garrett Kenyon Bruno Jedynak Portland State University 2020 © 2020 Sheng Y. Lundquist Abstract While deep learning has proven to be successful for various tasks in the field of computer vision, there are several limitations of deep-learning models when com- pared to human performance. Specifically, human vision is largely robust to noise and distortions, whereas deep learning performance tends to be brittle to modifi- cations of test images, including being susceptible to adversarial examples. Addi- tionally, deep-learning methods typically require very large collections of training examples for good performance on a task, whereas humans can learn to perform the same task with a much smaller number of training examples.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Text-To-Speech Synthesis
    Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR) “Hello my name is Heiga Zen” ! Machine translation (MT) “Hello my name is Heiga Zen” “Ich heiße Heiga Zen” ! Text-to-speech synthesis (TTS) “Hello my name is Heiga Zen” ! Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Speech production process text (concept) fundamental freq voiced/unvoiced char freq transfer frequency speech transfer characteristics magnitude start--end Sound source fundamental voiced: pulse frequency unvoiced: noise modulation of carrier wave by speech information air flow Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Typical ow of TTS system TEXT Sentence segmentation Word segmentation Text normalization Text analysis Part-of-speech tagging Pronunciation Speech synthesis Prosody prediction discrete discrete Waveform generation ) NLP Frontend discrete continuous SYNTHESIZED ) SEECH Speech Backend Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, of Rule-based, formant synthesis []
    [Show full text]
  • Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels
    Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels Hyeji Kim Himanshu Asnani Yihan Jiang Samsung AI Center School of Technology ECE Department Cambridge and Computer Science University of Washington Cambridge, United Tata Institute of Seattle, United States Kingdom Fundamental Research [email protected] [email protected] Mumbai, India [email protected] Sewoong Oh Pramod Viswanath Sreeram Kannan Allen School of ECE Department ECE Department Computer Science & University of Illinois at University of Washington Engineering Urbana Champaign Seattle, United States University of Washington Illinois, United States [email protected] Seattle, United States [email protected] [email protected] Abstract Designing codes that combat the noise in a communication medium has remained a significant area of research in information theory as well as wireless communica- tions. Asymptotically optimal channel codes have been developed by mathemati- cians for communicating under canonical models after over 60 years of research. On the other hand, in many non-canonical channel settings, optimal codes do not exist and the codes designed for canonical models are adapted via heuristics to these channels and are thus not guaranteed to be optimal. In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. TurboAE shows that the development of channel coding design can be automated via deep learning, with near-optimal performance.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Feature Selection/Extraction
    Feature Selection/Extraction Dimensionality Reduction Feature Selection/Extraction • Solution to a number of problems in Pattern Recognition can be achieved by choosing a better feature space. • Problems and Solutions: – Curse of Dimensionality: • #examples needed to train classifier function grows exponentially with #dimensions. • Overfitting and Generalization performance – What features best characterize class? • What words best characterize a document class • Subregions characterize protein function? – What features critical for performance? – Subregions characterize protein function? – Inefficiency • Reduced complexity and run-time – Can’t Visualize • Allows ‘intuiting’ the nature of the problem solution. Curse of Dimensionality Same Number of examples Fill more of the available space When the dimensionality is low Selection vs. Extraction • Two general approaches for dimensionality reduction – Feature extraction: Transforming the existing features into a lower dimensional space – Feature selection: Selecting a subset of the existing features without a transformation • Feature extraction – PCA – LDA (Fisher’s) – Nonlinear PCA (kernel, other varieties – 1st layer of many networks Feature selection ( Feature Subset Selection ) Although FS is a special case of feature extraction, in practice quite different – FSS searches for a subset that minimizes some cost function (e.g. test error) – FSS has a unique set of methodologies Feature Subset Selection Definition Given a feature set x={xi | i=1…N} find a subset xM ={xi1, xi2, …, xiM}, with M<N, that optimizes an objective function J(Y), e.g. P(correct classification) Why Feature Selection? • Why not use the more general feature extraction methods? Feature Selection is necessary in a number of situations • Features may be expensive to obtain • Want to extract meaningful rules from your classifier • When you transform or project, measurement units (length, weight, etc.) are lost • Features may not be numeric (e.g.
    [Show full text]
  • Feature Extraction (PCA & LDA)
    Feature Extraction (PCA & LDA) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline What is feature extraction? Feature extraction algorithms Linear Methods Unsupervised: Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform Supervised: Linear Discriminant Analysis (LDA) Also known as Fisher’s Discriminant Analysis (FDA) 2 Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature set Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space ⋮ ⋮ → ⋮ → ⋮ = ⋮ Feature Selection Feature ( <) Extraction 3 Feature Extraction Mapping of the original data onto a lower-dimensional space Criterion for feature extraction can be different based on problem settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space In the previous lecture, we talked about feature selection: Feature selection can be considered as a special form of feature extraction (only a subset of the original features are used). Example: 0100 X N 4 XX' 0010 X' N 2 Second and thirth features are selected 4 Feature Extraction Unsupervised feature extraction: () () A mapping : ℝ →ℝ ⋯ Or = ⋮⋱⋮ Feature Extraction only the transformed data () () () () ⋯ ′ ⋯′ = ⋮⋱⋮ () () ′ ⋯′ Supervised feature extraction: () () A mapping : ℝ →ℝ ⋯ Or = ⋮⋱⋮ Feature Extraction only the transformed
    [Show full text]