Machine Learning I Lecture 18 Multi-Layer Perceptron

Total Page:16

File Type:pdf, Size:1020Kb

Machine Learning I Lecture 18 Multi-Layer Perceptron ECE595 / STAT598: Machine Learning I Lecture 18 Multi-Layer Perceptron Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 28 Outline Discriminative Approaches Lecture 16 Perceptron 1: Definition and Basic Concepts Lecture 17 Perceptron 2: Algorithm and Property Lecture 18 Multi-Layer Perceptron: Back Propagation This lecture: Multi-Layer Perceptron: Back Propagation Multi-Layer Perceptron Hidden Layer Matrix Representation Back Propagation Chain Rule 4 Fundamental Equations Algorithm Interpretation c Stanley Chan 2020. All Rights Reserved. 2 / 28 Single-Layer Perceptron Input neurons x Weights w T Predicted label = σ(w x + w0). c Stanley Chan 2020. All Rights Reserved. 3 / 28 Multi-Layer Network https://towardsdatascience.com/ multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f Introduce a layer of hidden neurons So now you have two sets of weights: from input to hidden, and from hidden to output c Stanley Chan 2020. All Rights Reserved. 4 / 28 Many Hidden Layers You can introduce as many hidden layers as you want. Every time you add a hidden layer, you add a set of weights. c Stanley Chan 2020. All Rights Reserved. 5 / 28 Understanding the Weights Each hidden neuron is an output of a perceptron So you will have 2 1 1 1 3 2 3 2 1 3 w w ::: w x1 h1 11 12 1n 1 1 1 1 h 6w21 w22 ::: w2n 7 6x2 7 6 2 7 = 6 7 6 7 6 7 6 . .. 7 6 . 7 4:::5 4 . 5 4 . 5 h1 1 1 1 n wm1 wm2 ::: wmn xm c Stanley Chan 2020. All Rights Reserved. 6 / 28 Progression to DEEP (Linear) Neural Networks Single-layer: h = w T x Hidden-layer: h = W T x Two Hidden Layers: T T h = W 2 W 1 x Three Hidden Layers: T T T h = W 3 W 2 W 1 x A LOT of Hidden Layers: T T T h = W N ::: W 2 W 1 x c Stanley Chan 2020. All Rights Reserved. 7 / 28 Interpreting the Hidden Layer Each hidden neuron is responsible for certain features. Given an object, the network identifies the most likely features. c Stanley Chan 2020. All Rights Reserved. 8 / 28 Interpreting the Hidden Layer https://www.scientificamerican.com/article/springtime-for-ai-the-rise-of-deep-learning/ c Stanley Chan 2020. All Rights Reserved. 9 / 28 Two Questions about Multi-Layer Network How do we efficiently learn the weights? Ultimately we need to minimize the loss N X T T T 2 J(W 1;:::; W L) = kW L ::: W 2 W 1 x i − y i k i=1 One layer: Gradient descent. Multi-layer: Also gradient descent, also known as Back propagation (BP) by Rumelhart, Hinton and Williams (1986) Back propagation = Very careful book-keeping and chain rule What is the optimization landscape? Convex? Global minimum? Saddle point? Two-layer case is proved by Baldi and Hornik (1989) All local minima are global. A critical point is either a saddle point or global minimum. L-layer case is proved by Kawaguchi (2016). Also proved L-layer nonlinear network (with sigmoid between adjacent layers.) c Stanley Chan 2020. All Rights Reserved. 10 / 28 Outline Discriminative Approaches Lecture 16 Perceptron 1: Definition and Basic Concepts Lecture 17 Perceptron 2: Algorithm and Property Lecture 18 Multi-Layer Perceptron: Back Propagation This lecture: Multi-Layer Perceptron: Back Propagation Multi-Layer Perceptron Hidden Layer Matrix Representation Back Propagation Chain Rule 4 Fundamental Equations Algorithm Interpretation c Stanley Chan 2020. All Rights Reserved. 11 / 28 Back Propagation: A 20-Minute Tour You will be able to find A LOT OF blogs on the internet discussing how back propagation is being implemented. Some are mystifying back propagation Some literally just teach you the procedure of back propagation without telling you the intuition I find the following online book by Mike Nielsen fairly well-written http://neuralnetworksanddeeplearning.com/ The following slides are written based on Nielsen's book We will not go into great details The purpose to get you exposed to the idea, and de-mystify back propagation As stated before, back propagation is chain rule + very careful book keeping c Stanley Chan 2020. All Rights Reserved. 12 / 28 Back Propagation Here is the loss function you want to minimize: N X T T T 2 J(W 1;:::; W L) = kσ(W L : : : σ(W 2 σ(W 1 xi ))) − y i k i=1 You have a set of nonlinear activation functions, usually the sigmoid. To optimize, you need gradient descent. For example, for W 1 t+1 t t W 1 = W 1 − αrJ(W 1) But you need to do this for all W 1;:::; W L. And there are lots of sigmoid functions. Let us do the brute force. And this is back-propagation. (Really? Yes...) c Stanley Chan 2020. All Rights Reserved. 13 / 28 Let us See an Example Let us look at two layers T T 2 J(W 1; W 2) = kσ(W 2 σ(W 1 x)) − yk | {z } a2 Let us go backward: @J @J @a = · 2 @W 2 @a2 @W 2 Now, what is a2? T T a2 = σ(W 2 σ(W 1 x)) | {z } z2 So let us compute: @a @a @z 2 = 2 · 2 : @W 2 @z2 @W 2 c Stanley Chan 2020. All Rights Reserved. 14 / 28 Let us See an Example T T 2 J(W 1; W 2) = kσ(W 2 σ(W 1 x)) − yk | {z } a1 How about W 1? Again, let us go backward: @J @J @a = · 2 @W 1 @a2 @W 1 T But you can now repeat the calculation as follows (Let z1 = W 1 x) @a @a @a 2 = 2 1 @W 1 @a1 @W 1 @a @a @z = 2 1 1 @a1 @z1 W 1 So it is just a very long sequence of chain rule. c Stanley Chan 2020. All Rights Reserved. 15 / 28 Notations for Back Propagation The following notations are based on Nielsen's online book. The purpose of doing these is to write down a concise algorithm. Weights: 3 w24: The 3rd layer 3 w24: From 4-th neuron to 2-nd neuron c Stanley Chan 2020. All Rights Reserved. 16 / 28 Notations for Back Propagation Activation and Bias: 3 a1: 3rd layer, 1st activation 2 b3: 2nd layer, 3rd bias T Here is the relationship. Think of σ(w x + w0): ! ` X ` `−1 ` aj = σ wjk ak + bj : k c Stanley Chan 2020. All Rights Reserved. 17 / 28 Understanding Back Propagation This is the main equation ! ` X ` `−1 ` ` ` aj = σ wjk ak + bj ; or aj = σ(zj ): k | {z } ` zj ` ` aj : activation, zj : intermediate. c Stanley Chan 2020. All Rights Reserved. 18 / 28 Loss The loss takes the form of X L 2 C = (aj − yj ) j Think of two-class cross-entropy where each aL is a 2-by-1 vector c Stanley Chan 2020. All Rights Reserved. 19 / 28 Error Term The error is defined as ` @C δj = ` @zj You can show that at the output, L L @C @aj @C 0 L δj = L L = L σ (zj ): @aj @zj @aj c Stanley Chan 2020. All Rights Reserved. 20 / 28 4 Fundamental Equations for Back Propagation BP Equation 1: For the error in the output layer: L @C 0 L δj = L σ (zj ): (BP-1) @aj @C L First term: L is rate of change w.r.t. aj @aj 0 L L Second term: σ (zj ) = rate of change w.r.t. zj . So it is just chain rule. 1 P L 2 Example: If C = 2 j (yj − aj ) , then @C L L = (aj − yj ) @aj L 0 L Matrix-vector form: δ = raC σ (z ) c Stanley Chan 2020. All Rights Reserved. 21 / 28 4 Fundamental Equations for Back Propagation BP Equation 2: An equation for the error δ` in terms of the error in the next layer, δ`+1 δ` = ((w `+1)T δ`+1) σ0(z`): (BP-2) You start with δ`+1. Take weighted average w `+1. (BP-1) and (BP-2) can help you determine error at any layer. c Stanley Chan 2020. All Rights Reserved. 22 / 28 4 Fundamental Equations for Back Propagation Equation 3: An equation for the rate of change of the cost with respect to any bias in the network. @C ` ` = δj : (BP-3) @bj ` Good news: We have already known δj from Equation 1na dn 2. @C So computing ` is easy. @bj Equation 4: An equation for the rate of change of the cost with respect to any weight in the network. @C `−1 ` ` = ak δj (BP-4) @wjk Again, everything on the right is known. So it is easy to compute. c Stanley Chan 2020. All Rights Reserved. 23 / 28 Back Propagation Algorithm Below is a very concise summary of the BP algorithm c Stanley Chan 2020. All Rights Reserved. 24 / 28 Step 2: Feed Forward Step Let us take a closer look at Step 2 The feed forward step computes the intermediate variables and the activations z` = (w `)T a`−1 + b` a` = σ(z`): c Stanley Chan 2020. All Rights Reserved. 25 / 28 Step 3: Output Error Let us take a closer look at Step 3 The output error is given by (BP-1) L 0 L δ = raC σ (z ) c Stanley Chan 2020. All Rights Reserved. 26 / 28 Step 4: Output Error Let us take a closer look at Step 4 The error back propagation is given by (BP-2) δ` = ((w `+1)T δ`+1) σ0(z`): c Stanley Chan 2020.
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Implementing Machine Learning & Neural Network Chip
    Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Implementing Machine Learning & Neural Network Chip Architectures USING NETWORK-ON-CHIP INTERCONNECT IP TY GARIBAY Chief Technology Officer [email protected] Title: Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Primary author: Ty Garibay, Chief Technology Officer, ArterisIP, [email protected] Secondary author: Kurt Shuler, Vice President, Marketing, ArterisIP, [email protected] Abstract: A short tutorial and overview of machine learning and neural network chip architectures, with emphasis on how network-on-chip interconnect IP implements these architectures. Time to read: 10 minutes Time to present: 30 minutes Copyright © 2017 Arteris www.arteris.com 1 Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP What is Machine Learning? “ Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, IBM, 1959 • Machine learning is a subset of Artificial Intelligence Copyright © 2017 Arteris 2 • Machine learning is a subset of artificial intelligence, and explicitly relies upon experiential learning rather than programming to make decisions. • The advantage to machine learning for tasks like automated driving or speech translation is that complex tasks like these are nearly impossible to explicitly program using rule-based if…then…else statements because of the large solution space. • However, the “answers” given by a machine learning algorithm have a probability associated with them and can be non-deterministic (meaning you can get different “answers” given the same inputs during different runs) • Neural networks have become the most common way to implement machine learning.
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Audio Event Classification Using Deep Learning in an End-To-End Approach
    Audio Event Classification using Deep Learning in an End-to-End Approach Master thesis Jose Luis Diez Antich Aalborg University Copenhagen A. C. Meyers Vænge 15 2450 Copenhagen SV Denmark Title: Abstract: Audio Event Classification using Deep Learning in an End-to-End Approach The goal of the master thesis is to study the task of Sound Event Classification Participant(s): using Deep Neural Networks in an end- Jose Luis Diez Antich to-end approach. Sound Event Classifi- cation it is a multi-label classification problem of sound sources originated Supervisor(s): from everyday environments. An auto- Hendrik Purwins matic system for it would many applica- tions, for example, it could help users of hearing devices to understand their sur- Page Numbers: 38 roundings or enhance robot navigation systems. The end-to-end approach con- Date of Completion: sists in systems that learn directly from June 16, 2017 data, not from features, and it has been recently applied to audio and its results are remarkable. Even though the re- sults do not show an improvement over standard approaches, the contribution of this thesis is an exploration of deep learning architectures which can be use- ful to understand how networks process audio. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents 1 Introduction1 1.1 Scope of this work.............................2 2 Deep Learning3 2.1 Overview..................................3 2.2 Multilayer Perceptron...........................4
    [Show full text]
  • Deep Learning Architectures for Sequence Processing
    Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright © 2021. All rights reserved. Draft of September 21, 2021. CHAPTER Deep Learning Architectures 9 for Sequence Processing Time will explain. Jane Austen, Persuasion Language is an inherently temporal phenomenon. Spoken language is a sequence of acoustic events over time, and we comprehend and produce both spoken and written language as a continuous input stream. The temporal nature of language is reflected in the metaphors we use; we talk of the flow of conversations, news feeds, and twitter streams, all of which emphasize that language is a sequence that unfolds in time. This temporal nature is reflected in some of the algorithms we use to process lan- guage. For example, the Viterbi algorithm applied to HMM part-of-speech tagging, proceeds through the input a word at a time, carrying forward information gleaned along the way. Yet other machine learning approaches, like those we’ve studied for sentiment analysis or other text classification tasks don’t have this temporal nature – they assume simultaneous access to all aspects of their input. The feedforward networks of Chapter 7 also assumed simultaneous access, al- though they also had a simple model for time. Recall that we applied feedforward networks to language modeling by having them look only at a fixed-size window of words, and then sliding this window over the input, making independent predictions along the way. Fig. 9.1, reproduced from Chapter 7, shows a neural language model with window size 3 predicting what word follows the input for all the. Subsequent words are predicted by sliding the window forward a word at a time.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]