Multilayer Perceptron Backpropagation Summary

Total Page:16

File Type:pdf, Size:1020Kb

Multilayer Perceptron Backpropagation Summary Multilayer Perceptron Backpropagation Summary C. Chadoulos, P. Kaplanoglou, S. Papadopoulos, Prof. Ioannis Pitas Aristotle University of Thessaloniki [email protected] www.aiia.csd.auth.gr Version 2.8 Biological Neuron ⚫ Basic computational unit of the brain. ⚫ Main parts: • Dendrites ➔ Act as inputs. • Soma ➔ Main body of neuron. • Axon ➔ Acts as output. ⚫ Neurons connect with other neurons via synapses. 2 Artificial Neurons ⚫ Artificial neurons are mathematical models loosely inspired by their biological counterparts. x2 ⚫ Incoming signals through the axons serve as the input vector: x1 x 푇 n 퐱 = 푥1, 푥2, … , 푥푛 , 푥푖 ∈ ℝ. w1 ⚫ The synaptic weights are grouped in a weight vector: w w1 n 푇 Z 퐰 = [푤1, 푤2, … , 푤푛] , 푤푖 ∈ ℝ. ⚫ Synaptic integration is modeled as the inner product: threshold 푁 푇 푧 = ෍ 푤푖 푥푖 = 퐰 퐱. 푖=1 10 Perceptron threshold Axon from a neuron synapse dendrit e Cell body Activation Function 11 Perceptron ⚫ Simplest mathematical model of a neuron (Rosenblatt – McCulloch & Pitts) ⚫ Inputs are binary, 푥푖 ∈ 0,1 . ⚫ Produces single, binary outputs, 푦푖 ∈ {0,1}, through an activation function 푓(∙). ⚫ Output 푦푖 signifies whether the neuron will fire or not. ⚫ Firing threshold: 퐰푇퐱 ≥ −푏 ⇒ 퐰푇퐱 + 푏 ≥ 0. Cell body 12 MLP Architecture ⚫ Multilayer perceptrons (feed-forward neural networks) typically consist of 퐿 + 1 layers with 푘푙 neurons in each layer: 푙 = 0, … , 퐿. ⚫ The first layer (input layer 푙 = 0) contains 푛 inputs, where 푛 is the dimensionality of the input sample vector. ⚫ The 퐿 − 1 hidden layers 푙 = 1, … , 퐿 − 1 can contain any number of neurons. 26 MLP Architecture ⚫ The output layer 푙 = 퐿 typically contains either one neuron (especially when dealing with regression problems) or 푚 neurons, where 푚 is the number of classes. ⚫ Thus, neural networks can be described by the following characteristic numbers: 1.Depth: Number of hidden layers 퐿. 2.Width: Number of neurons per layer 푘푙, 푙 = 1, … , 퐿. 3. 푘퐿 = 1, or 푘퐿 = 푚. 26 MLP Forward Propagation ⚫ Following the same procedure for 푡ℎ 푓(푧 푙 ) the 푙 hidden layer, we get: 푧(푙) 1 (푙) 1 푎1 . 푙 . 푓(푧푖 ) (푙) (푙) 푎 푧푖 푖 . ⚫ In compact form: . 푙 (푙) 푓(푧푗 )1 (푙) 푧푗 푎푗 . 푙 푓(푧푘 ) 푧(푙) 푙 푎(푙) 푘푙 푘푙 . 29 MLP Activation Functions ⚫ Previous theorem verifies MLPs as universal approximators, under the condition that the activation functions 휙 ∙ for each neuron are continuous. ⚫ Continuous (and differentiable) activation functions: 1 푛 ⚫ 휎 푥 = , ℝ → 0,1 1+푒−푥 0, 푥 < 0 푛 ⚫ 푅푒퐿푈 푥 = ቊ , ℝ → ℝ 푥, 푥 ≥ 0 + −1 푛 ⚫ 푓 푥 = tan 푥 , ℝ → −휋Τ2 , 휋Τ2 푥 −푥 푒 −푒 푛 ⚫ 푓 푥 = tanh 푥 = , ℝ → −1, +1 푒푥+푒−푥 25 Softmax Layer • It is the last layer in a neural network classifier. • The response of neuron 푖 in the softmax layer 퐿 is calculated with regard to the value of its activation function 푎푖 = 푓 푧푖 . 푒푎푖 푦ො푖 = 푔 푎푖 = 푘 : ℝ → 0,1 , 푖 = 1, . , 푘퐿 σ 퐿 푎푘 푘=1 푒 푘퐿 • The responses of softmax neurons sum up to one: σ푖=1 푦ො푖 = 1. • Better representation for mutually exclusive class labels. 31 MLP Example Fully connected • Example architecture with 퐿 = 2 layer and 푛 = 푚0 input features, 푚1 neurons at the first layer 푚 and 푚2 output units. 푙−1 푎푗 푙 = 푓푙( ෍ 푤푖푗 푙 푎푖(푙 − 1) + 푤0푗(푙)) 푖=1 Input Axons Synapses Dendrites Axons Synapses Dendrites Output 푙 = 0 1 1 푙 = 1 푤01(2) 푙 = 2 푤01(1) 1 푧 (1) 푎 (1) 푤 2 푎 (1) 푧 (2) 푦ො = 푎 (2) 푥 푤11(1)푥1 1 1 푤11(2) 11 1 1 1 1 1 푤11(1) … 푥1 푤21(1) 푧 (1) 2 푤푚11(2) … 푥2 … 푤0푚2 (2) 푥1 푤 (1) … 1푚1 푤1푚1(1)푥1 푤1푚2 (2) 푤1푚2 2 푎1(1) … 푧 (1) 푧푚 (2) 푦ො푚 = 푎푚 (2) 푥푛 푚1 푎푚1(1) … 2 2 2 푥푛0 푤푛푚1 (1) 푤푚1푚2 (2) 32 MLP Training • Mean Square Error (MSE): 1 퐽 훉 = 퐽 퐖, 퐛 = σ푁 퐲ෝ − 퐲 2. 푁 푖=1 푖 푖 • It is suitable for regression and classification. • Categorical Cross Entropy Error: 푁 푚 퐽퐶퐶퐸 = − σ푖=1 σj=1 푦푖푗 log 푦ො푖푗 . • It is suitable for classifiers that use softmax output layers. • Exponential Loss: 푇 푁 −훽퐲 퐲ෝ푖 퐽(훉) = σ푖=1 푒 푖 . 37 MLP Training 휕퐽 휕퐽 • Differentiation: = = ퟎ can provide the critical points: 휕퐖 휕퐛 • Minima, maxima, saddle points. • Analytical computations of this kind are usually impossible. • Need to resort to numerical optimization methods. 38 MLP Training • Iteratively search the parameter space for the optimal values. In each iteration, apply a small correction to the previous value. 훉 푡 + 1 = 훉 푡 + ∆훉. • In the neural network case, jointly optimize 퐖, 퐛 퐖 푡 + 1 = 퐖 푡 + ∆퐖, 퐛 푡 + 1 = 퐛 푡 + ∆퐛. • What is an appropriate choice for the correction term? 38 MLP Training Steepest descent on a function surface. 39 MLP Training • Gradient Descent, as applied to Artificial Neural Networks: • Equations for the 푗푡ℎ neuron in the 푙푡ℎ layer: 40 Backpropagation ⚫ We can now state the four essential equations for the implementation of the backpropagation algorithm: (BP1) (BP2) (BP3) (BP4) 45 Backpropagation Algorithm 퐀퐥퐠퐨퐫퐢퐭퐡퐦: Neural Network Training with Gradient Descent and Back propagation 푁 Input: learning rate 휇, dataset { 퐱푖, 퐲푖 }푖=1 퐑퐚퐧퐝퐨퐦 퐢퐧퐢퐭퐢퐚퐥퐢퐳퐚퐭퐢퐨퐧 퐨퐟 퐧퐞퐭퐰퐨퐫퐤 퐩퐚퐫퐚퐦퐞퐭퐞퐫퐬 (퐖, 퐛) 1 퐃퐞퐟퐢퐧퐞 퐜퐨퐬퐭 퐟퐮퐧퐜퐭퐢퐨퐧 퐽 = σ푁 퐲 − 퐲ො 2 푁 푖=1 푖 푖 퐰퐡퐢퐥퐞 표푝푡푖푚푖푧푎푡푖표푛 푐푟푖푡푒푟푖푎 푛표푡 푚푒푡 퐝퐨 퐽 = 0 퐅퐨퐫퐰퐚퐫퐝 퐩퐚퐬퐬 퐨퐟 퐰퐡퐨퐥퐞 퐛퐚퐭퐜퐡 퐲ො = 퐟푁푁 퐱; 퐖, 퐛 푁 1 퐽 = ෍ 퐲 − 퐲ො 2 푁 푖 푖 푖=1 퐿 푑푓 푧푗 훅 퐿 = ∇ 퐽 ∘ 푎 퐿 푑푧푗 퐟퐨퐫 푙 = 1, … , 퐿 − 1 퐝퐨 푑푓 퐳 푙 훅 푙 = 퐰 푙+1 훅 푙+1 ∘ 푑퐳 푙 휕퐽 = 훿 푙 푎 푙−1 푙 푗 푘 휕푤푗푘 휕퐽 = 훿 푙 푙 푗 휕푏푗 퐞퐧퐝 퐖 ← 퐖 − 휇∇퐰퐽(퐖, 퐛) 퐛 ← 퐛 − 휇∇퐛퐽(퐖, 퐛) 퐞퐧퐝 46 Optimization Review ⚫ Batch Gradient Descent problems: 3) Ill conditioning: Takes into account only first – order information (1푠푡 derivative). No information about curvature of cost function may lead to zig-zagging motions during optimization, resulting in very slow convergence. 4) May depend too strongly on the initialization of the parameters. 5) Learning rate must be chosen very carefully. 48 Optimization Review ⚫ Stochastic Gradient Descent (SGD): 1. Passing the whole dataset through the network just for a small update in the parameters is inefficient. 2.Most cost functions used in Machine Learning and Deep Learning applications can be decomposed into a sum over training samples. Instead of computing the gradient through the whole training set, we decompose the training set {(퐱푖, 퐲푖), 푖 = 1, … 푁} to 퐵 mini-batches (possibly same cardinality 푆) to estimate 훻휃퐽 from 훻휃퐽푏, 푏 = 1, … , 퐵: 푆 푁 푆 σ푠=1 훻휃퐽퐱 σ 훻휃퐽퐱 1 푠 ≈ 푖=1 푖 = 훻 퐽 ⇒ 훻 퐽 ≈ ෍ 훻 퐽 푆 푁 휃 휃 푆 휃 퐱푠 푠=1 49 Optimization Review • Adaptive Learning Rate Algorithms: • Up to this point, the learning rate was given as an input to the algorithm and was kept stable throughout the whole optimization procedure. • To overcome the difficulties of using a set rate, adaptive rate algorithms assign a different learning rate for each separate model parameter. 51 Optimization Review • Adaptive Learning Rate Algorithms: 1.AdaGrad algorithm: i. Gradient accumulation variable 퐫 = ퟎ 1 ii.Compute and accumulate gradient 퐠 ← 훻 σ 퐽 푓 퐱 푖 ; 훉 푖 , 퐲 , 퐫 ← 퐫 + 퐠 ∘ 퐠 푚 휽 푖 푖 휇 iii.Compute update ∆훉 ← − ∘ 퐠 and apply update 훉 ← 훉 + ∆훉 휀+ 퐫 Parameters with largest partial derivative of the loss have a more rapid decrease in their learning rate. ○ denotes element-wise multiplication. 51 Generalization • Good performance on the training set alone is not a satisfying indicator of a good model. • We are interested in building models that can perform well on data they have not been used the training process. Bad generalization capabilities mainly occur through 2 forms: 1.Overfitting: • Overfitting can be detected when there is great discrepancy between the performance of the model in the training set, and that in any other set. It can happen due to excessive model complexity. • An example of overfitting would be trying to model a quadratic equation with a 10-th degree polynomial. 54 Generalization 2.Underfitting: • On the other hand, underfitting occurs when a model cannot accurately capture the underlying structrure of data. • Underfitting can be detected by a very low performance in the training set. • Example: trying to model a non-linear equation using linear ones. 55 Revisiting Activation Functions • To overcome the shortcomings of the sigmoid activation function, the Rectified Linear Unit (ReLU) was introduced. • Very easy to compute. • Due to its quasi-linear form leads to faster convergence (does not saturate at high values) • Units with zero derivative will not activate at all. 59 Training on Large Scale Datasets • Large number of training samples in the magnitude of hundreds of thousands. • Problem: Datasets do not fit in memory. • Solution: Using mini-batch SGD method. • Many classes, in the magnitude of hundreds up to one thousand. • Problem: Difficult to converge using MSE error. • Solution: Using Categorical Cross Entropy (CCE) loss on Softmax output. 60 Towards Deep Learning ⚫ Increasing the network depth (layer number) 퐿 can result in negligible weight updates in the first layers, because the corresponding deltas become very small or vanish completely ⚫ Problem: Vanishing gradients. ⚫ Solution: Replacing sigmoid with an activation function without an upper bound, like a rectifier (a.k.a. ramp function, ReLU).
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Audio Event Classification Using Deep Learning in an End-To-End Approach
    Audio Event Classification using Deep Learning in an End-to-End Approach Master thesis Jose Luis Diez Antich Aalborg University Copenhagen A. C. Meyers Vænge 15 2450 Copenhagen SV Denmark Title: Abstract: Audio Event Classification using Deep Learning in an End-to-End Approach The goal of the master thesis is to study the task of Sound Event Classification Participant(s): using Deep Neural Networks in an end- Jose Luis Diez Antich to-end approach. Sound Event Classifi- cation it is a multi-label classification problem of sound sources originated Supervisor(s): from everyday environments. An auto- Hendrik Purwins matic system for it would many applica- tions, for example, it could help users of hearing devices to understand their sur- Page Numbers: 38 roundings or enhance robot navigation systems. The end-to-end approach con- Date of Completion: sists in systems that learn directly from June 16, 2017 data, not from features, and it has been recently applied to audio and its results are remarkable. Even though the re- sults do not show an improvement over standard approaches, the contribution of this thesis is an exploration of deep learning architectures which can be use- ful to understand how networks process audio. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents 1 Introduction1 1.1 Scope of this work.............................2 2 Deep Learning3 2.1 Overview..................................3 2.2 Multilayer Perceptron...........................4
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Approaching Hanabi with Q-Learning and Evolutionary Algorithm
    St. Cloud State University theRepository at St. Cloud State Culminating Projects in Computer Science and Department of Computer Science and Information Technology Information Technology 12-2020 Approaching Hanabi with Q-Learning and Evolutionary Algorithm Joseph Palmersten [email protected] Follow this and additional works at: https://repository.stcloudstate.edu/csit_etds Part of the Computer Sciences Commons Recommended Citation Palmersten, Joseph, "Approaching Hanabi with Q-Learning and Evolutionary Algorithm" (2020). Culminating Projects in Computer Science and Information Technology. 34. https://repository.stcloudstate.edu/csit_etds/34 This Starred Paper is brought to you for free and open access by the Department of Computer Science and Information Technology at theRepository at St. Cloud State. It has been accepted for inclusion in Culminating Projects in Computer Science and Information Technology by an authorized administrator of theRepository at St. Cloud State. For more information, please contact [email protected]. Approaching Hanabi with Q-Learning and Evolutionary Algorithm by Joseph A Palmersten A Starred Paper Submitted to the Graduate Faculty of St. Cloud State University In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science December, 2020 Starred Paper Committee: Bryant Julstrom, Chairperson Donald Hamnes Jie Meichsner 2 Abstract Hanabi is a cooperative card game with hidden information that requires cooperation and communication between the players. For a machine learning agent to be successful at the Hanabi, it will have to learn how to communicate and infer information from the communication of other players. To approach the problem of Hanabi the machine learning methods of Q- learning and Evolutionary algorithm are proposed as potential solutions.
    [Show full text]
  • 4 Nonlinear Regression and Multilayer Perceptron
    4 Nonlinear Regression and Multilayer Perceptron In nonlinear regression the output variable y is no longer a linear function of the regression parameters plus additive noise. This means that estimation of the parameters is harder. It does not reduce to minimizing a convex energy functions { unlike the methods we described earlier. The perceptron is an analogy to the neural networks in the brain (over-simplified). It Pd receives a set of inputs y = j=1 !jxj + !0, see Figure (3). Figure 3: Idealized neuron implementing a perceptron. It has a threshold function which can be hard or soft. The hard one is ζ(a) = 1; if a > 0, ζ(a) = 0; otherwise. The soft one is y = σ(~!T ~x) = 1=(1 + e~!T ~x), where σ(·) is the sigmoid function. There are a variety of different algorithms to train a perceptron from labeled examples. Example: The quadratic error: t t 1 t t 2 E(~!j~x ; y ) = 2 (y − ~! · ~x ) ; for which the update rule is ∆!t = −∆ @E = +∆(yt~! · ~xt)~xt. Introducing the sigmoid j @!j function rt = sigmoid(~!T ~xt), we have t t P t t t t E(~!j~x ; y ) = − i ri log yi + (1 − ri) log(1 − yi ) , and the update rule is t t t t ∆!j = −η(r − y )xj, where η is the learning factor. I.e, the update rule is the learning factor × (desired output { actual output)× input. 10 4.1 Multilayer Perceptrons Multilayer perceptrons were developed to address the limitations of perceptrons (introduced in subsection 2.1) { i.e.
    [Show full text]
  • 4 Perceptron Learning
    4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch–Pitts units and perceptrons, but the question of how to find the parameters adequate for a given task was left open. If two sets of points have to be separated linearly with a perceptron, adequate weights for the comput- ing unit must be found. The operators that we used in the preceding chapter, for example for edge detection, used hand customized weights. Now we would like to find those parameters automatically. The perceptron learning algorithm deals with this problem. A learning algorithm is an adaptive method by which a network of com- puting units self-organizes to implement the desired behavior. This is done in some learning algorithms by presenting some examples of the desired input- output mapping to the network. A correction step is executed iteratively until the network learns to produce the desired response. The learning algorithm is a closed loop of presentation of examples and of corrections to the network parameters, as shown in Figure 4.1. network test input-output compute the examples error fix network parameters Fig. 4.1. Learning process in a parametric system R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996 78 4 Perceptron Learning In some simple cases the weights for the computing units can be found through a sequential test of stochastically generated numerical combinations. However, such algorithms which look blindly for a solution do not qualify as “learning”. A learning algorithm must adapt the network parameters accord- ing to previous experience until a solution is found, if it exists.
    [Show full text]