Advanced ML: Unsupervised Learning with Autoregressive and Latent Variable Models

Total Page:16

File Type:pdf, Size:1020Kb

Advanced ML: Unsupervised Learning with Autoregressive and Latent Variable Models Advanced ML: Unsupervised learning with autoregressive and latent variable models https://github.com/janchorowski/JSALT2019_tutorials Jan Chorowski University of Wrocław With slides from TMLSS 2018 (Jan Chorowski, Urlich Paquet) Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 1986 ‘hidden’ units which are not part of the input or output come to represent important features of the task domain Deep Learning history: 2006 Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning goals • Learn a data representation: – Extract features for downstream tasks – Describe the data (clusterings) • Data generation / density estimation – What are my data – Outlier detection – Super-resolution – Artifact removal – Compression 2 minute exercise Talk to your friend next to you, and tell him or her everything you can about this data set: The rows are correlated Latent representation Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning high dimensional distributions is hard • Assume we work with small (32x32) images • Each data point is a real vector of size 32 × 32 × 3 • Data occupies only a tiny fraction of ℝ32×32×3 • Impossible to directly fit standard prob. distribution! Divide and conquer What is easier: 1. Write a paragraph at once? 2. Guess the next word? Mary had a little Mary had a little dog. Mary had a little dog. It Mary had a little dog. It was Mary had a little dog. It was cute Chain rule Decompose probability of 푛-dimensional data points into product of 푛 conditional probabilities 푝 푥 = 푝 푥1, 푥2, … , 푥푛 = 푝 푥1 푝 푥2 푥1 … 푝 푥푛 푥1, 푥2, … , 푥푛−1 = 푝(푥푡|푥1, 푥2, … , 푥푡−1) 푡 Learning 푝 푥푡 푥1, … , 푥푡−1 The good: It looks like supervised learning! The bad: We need to handle variable-sized input. Large n -> sparse data Consider an n-gram LM: #푥푡−푘:푡 푝 푥푡 푥푡−푘+1, … , 푥푡−1 = #푥푡−푘:푡−1 As 푛 → ∞ n-grams become unique: - Probability of all but one continuation is 0 - Can’t model long contexts Solution: Use a representation in which “llama ≈ animal” Learning 푝 푥푡 푥1, … , 푥푡−1 #1 Assume distant past doesn’t matter, 푝 푥푡 푥1, … , 푥푡−1 ≈ 푝 푥푡 푥푡−푘+1, … , 푥푡−1 Quite popular: • n-gram text models (estimate 푝 by counting) • FIR filters How to efficiently handle large 푘? Neural LM Compute 푝 푥푡 푥푡−푘+1, … , 푥푡−1 with a Neural Net Table Lookup Representation Concatenation Extractor* Table Sentence Softmax Lookup An arbitrary sub-graph Table Lookup Looks somewhat like a convolution. Still: large “n” => many params, how to fix? Y. Bengio et al “A Neural Probabilistic Language Model”, NIPS 2011 Dilated Convolutions Large (but finite) receptive fields Few parameters WaveNet ByteNet https://arxiv.org/abs/1609.03499 https://arxiv.org/abs/1610.10099 Outline • Why unsupervised learning • Autoregressive models – Intuitions – Dilated convolutions RNNs (Intermezzo: LSTM training dynamics) Transformers – Beyond sequences, summary • Latent variable models – VAE – Normalizing flows • Autoregressive + latent variable: why and how? Learning 푝 푥푡 푥1, … , 푥푡−1 #2 Introduce a hidden (i.e. not directly observed) state ℎ푡. ℎ푡 summarizes 푥1, … , 푥푡−1. Compute ℎ푡 using this recurrence: ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) Recurrences are powerful Input: a sequence of bits 1,0,1,0,0,1 Output: the parity Solution: The hidden state will be just 1 bit – parity so far: ℎ0 = 0 ℎ푡 = 푋푂푅 ℎ푡−1, 푥푡 푦푇 = ℎ푇 Constant operating memory! Works for arbitrary long sequences! Recurrent Neural Networks (RNNs) ℎ푡 = 푓 ℎ푡−1, 푥푡 푝 푥푡+1 푥1, … , 푥푡 = 푔(ℎ푡) 푓, 푔 are implemented as feedforward neural nets. 푓 푔 RNNs naturally handle sequences Training: unroll and supervise at every step: 푝(ℎ푎푑|푀푎푟푦) 푝(푎| … ) 푝(푙푡푡푙푒| … ) … Mary had a ….. RNNs naturally handle sequences Generation: sample one element at a time Mary had a … <SOS> 푥1 sampled from 푝 푥1 This This impacts training! thesame function! compute all “arrows” deep network, a is very RNN 푓 RNNs are dynamical systems dynamical RNNs are 푔 … RNN behavior - intuitions A linear dynamical system computes 푡 ℎ푡 = 푤 ⋅ ℎ푡−1 = 푤 ℎ0 When: 1. 푤 > 1 it diverges: lim ℎ푡 = sign h0 ⋅ ∞ 푡→∞ 2. 푤 = 1 it is constant lim ℎ푡 = h0 푡→∞ 3. −1 < 푤 < 1 it decays to 0: lim ℎ푡 = 0 푡→∞ 4. 푤 = −1 flips between ±ℎ0 5. 푤 < −1 diverges, changes sign at each step A multidimensional linear dyn. system 푛 푛×푛 ℎ푡 ∈ ℝ , 푊 ∈ ℝ ℎ푡 = 푊ℎ푡−1 Compute the eigendecomposition of 푊: 휆11 −1 −1 푊 = 푄Λ푄 = 푄 ⋱ 푄 휆푛푛 Then 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 = 푛 휆11 −1 = 푄 ⋱ 푄 ℎ0 푛 휆푛푛 A multidimensional dyn. system cont. 푛 n −1 ℎ푡 = 푊ℎ푡−1 = 푊 ℎ0 = 푄Λ 푄 ℎ0 If largest eigenvalue has norm: 1. >1, system diverges 2. =1, system is stable or oscillates 3. <1, system decays to 0 This is similar to the scalar case. A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Output is bounded (can’t diverge!) If 푤 > 1 has 3 fixed points: 0, ±ℎ푓. Starting the iteration form ℎ0 ≠ 0 it ends at ±ℎ푓 (effectively remembers the sign). If 푤 ≤ 1 has one fixed point: 0 A nonlinear dynamical system ℎ푡 = tanh(푤 ⋅ ℎ푡−1) , 푤 > 1 ℎ 푡 Gradients in RNNs Recall RNN equations: ℎ푡 = 푓 ℎ푡−1, 푥푡; 휃 푦푡 = 푔 ℎ푡 Assume supervision only at the last step: L = 푒(푦푇) The gradient is 휕퐿 휕퐿 흏풉푻 휕ℎ푡 = 휕휃 휕ℎ푇 흏풉풕 휕휃 푡 휕ℎ Trouble with 푇 휕ℎ푡 휕ℎ푇 measures how much ℎ푇 changes when ℎ푡 changes. 휕ℎ푡 ℎ 푡 휕ℎ Another look at 푇 휕ℎ푡 Let: ℎ푡 = tanh(푤 ⋅ ℎ푡−1) Then: 휕ℎ푡 휕ℎ푡 휕ℎ푡−1 휕ℎ푡−2 휕ℎ1 = … 휕ℎ0 휕ℎ푡−1 휕ℎ푡−2 휕ℎ푡−3 휕ℎ0 휕ℎ+1 = tanh ′(푤ℎ푡) 푤 휕ℎ 푡−1 푡−1 휕ℎ푡 푡 = tanh ′(푤ℎ) 푤 = 푤 tanh ′(푤ℎ) 휕ℎ0 =0 =0 Backward phase is linear! Vanishing gradient 휕ℎ If 푇 = 0 the network forgets all information 휕ℎ푡 from time 푡 when it reaches time 푇. This makes it impossible to discover correlations between inputs at distant time steps. This is a modeling problem. Exploding gradient 휕ℎ 휕Loss When 푇 is large will be large too. 휕ℎ푡 휕휃 휕Loss Making a gradient step 휃 ← 훼 can 휕휃 drastically change the network, or even destroy it. This is not a problem of information flow, but of training stability! Exploding gradient intuition #1 ℎ푡 = tanh(2ℎ푡−1 + 푏) ℎ 푡 Exploding gradient intuition #2 ℎ0 = 휎(0.5) ℎ푡 = 휎 푤ℎ푡−1 + 푏 2 퐿 = ℎ50 − 0.7 R. Pascanu et al. https://arxiv.org/abs/1211.5063 휕ℎ Summary: trouble with 푇 휕ℎ푡 RNNs are difficult to train because: 휕ℎ푇 can be 0 – vanishing gradient 휕ℎ푡 휕ℎ푇 can be ∞ – exploding gradient 휕ℎ푡 With both phenomena governed by the spectral norm of the weight matrix (norm of the largest eigenvalue, magnitude of 푤 in the scalar case). Exploding gradient solution Don’t do large steps. Pick a gradient norm threshold and scale down all larger gradients. This prevents the model from doing a large learning update and destroying itself. Vanishing gradient solution: LSTM, the RNN workhorse 휕ℎ 푇 ≈ 0: step 푇 forgots information from 푡 휕ℎ푡 The dynamical system ℎ푡 = 푤ℎ푡−1 maximally preserves information when 푤 = 1 LSTM introduces a memory cell 푐푡 that will keep information forever: 푐푡 = 푐푡−1 Memory cell 푐푡 Memory cell preserves information 푐푡 = 푐푡−1 휕푐푇 = 1 휕푐푡 Gates 휎(푊푥푥푡 + ⋯ ) ⋅ + 푐푡 tanh(푊푥푐푥푡 + ⋯ ) Gates selectively load information into the memory cell: 푡 = 휎(푊푥푥푡 + ⋯ ) 푐푡 = 푐푡−1 + 푡 ⋅ tanh(푊푥푐푥푡 + ⋯ ) LSTM: the details Hidden state is a pair of: -푐푡 information in the cell, hidden from the rest of the network -ℎ푡 information extracted form the cell into the network Update equations: 푡 = 휎 푊푥푥푡 + 푊ℎℎ푡−1 + 푏 푓푡 = 휎 푊푥푓푥푡 + 푊ℎ푓ℎ푡−1 + 푏푓 표푡 = 휎 푊푥표푥푡 + 푊ℎ표ℎ푡−1 + 푏표 푐푡 = 푡 tanh(푊푥푐푥푡 + 푊ℎ푐ℎ푡−1 + 푏푐) + 푓푡푐푡−1 ℎ푡 = 표푡 tanh 푐푡 RNNs summary LSTM/GRU
Recommended publications
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Nonlinear Dimensionality Reduction by Locally Linear Embedding
    R EPORTS 2 ˆ 35. R. N. Shepard, Psychon. Bull. Rev. 1, 2 (1994). 1ÐR (DM, DY). DY is the matrix of Euclidean distanc- ing the coordinates of corresponding points {xi, yi}in 36. J. B. Tenenbaum, Adv. Neural Info. Proc. Syst. 10, 682 es in the low-dimensional embedding recovered by both spaces provided by Isomap together with stan- ˆ (1998). each algorithm. DM is each algorithm’s best estimate dard supervised learning techniques (39). 37. T. Martinetz, K. Schulten, Neural Netw. 7, 507 (1994). of the intrinsic manifold distances: for Isomap, this is 44. Supported by the Mitsubishi Electric Research Labo- 38. V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduc- the graph distance matrix DG; for PCA and MDS, it is ratories, the Schlumberger Foundation, the NSF tion to Parallel Computing: Design and Analysis of the Euclidean input-space distance matrix DX (except (DBS-9021648), and the DARPA Human ID program. Algorithms (Benjamin/Cummings, Redwood City, CA, with the handwritten “2”s, where MDS uses the We thank Y. LeCun for making available the MNIST 1994), pp. 257Ð297. tangent distance). R is the standard linear correlation database and S. Roweis and L. Saul for sharing related ˆ 39. D. Beymer, T. Poggio, Science 272, 1905 (1996). coefficient, taken over all entries of DM and DY. unpublished work. For many helpful discussions, we 40. Available at www.research.att.com/ϳyann/ocr/mnist. 43. In each sequence shown, the three intermediate im- thank G. Carlsson, H. Farid, W. Freeman, T. Griffiths, 41. P. Y. Simard, Y. LeCun, J.
    [Show full text]
  • Almost Unsupervised Text to Speech and Automatic Speech Recognition
    Almost Unsupervised Text to Speech and Automatic Speech Recognition Yi Ren * 1 Xu Tan * 2 Tao Qin 2 Sheng Zhao 3 Zhou Zhao 1 Tie-Yan Liu 2 Abstract 1. Introduction Text to speech (TTS) and automatic speech recognition (ASR) are two popular tasks in speech processing and have Text to speech (TTS) and automatic speech recog- attracted a lot of attention in recent years due to advances in nition (ASR) are two dual tasks in speech pro- deep learning. Nowadays, the state-of-the-art TTS and ASR cessing and both achieve impressive performance systems are mostly based on deep neural models and are all thanks to the recent advance in deep learning data-hungry, which brings challenges on many languages and large amount of aligned speech and text data. that are scarce of paired speech and text data. Therefore, However, the lack of aligned data poses a ma- a variety of techniques for low-resource and zero-resource jor practical problem for TTS and ASR on low- ASR and TTS have been proposed recently, including un- resource languages. In this paper, by leveraging supervised ASR (Yeh et al., 2019; Chen et al., 2018a; Liu the dual nature of the two tasks, we propose an et al., 2018; Chen et al., 2018b), low-resource ASR (Chuang- almost unsupervised learning method that only suwanich, 2016; Dalmia et al., 2018; Zhou et al., 2018), TTS leverages few hundreds of paired data and extra with minimal speaker data (Chen et al., 2019; Jia et al., 2018; unpaired data for TTS and ASR.
    [Show full text]
  • Comparison of Dimensionality Reduction Techniques on Audio Signals
    Comparison of Dimensionality Reduction Techniques on Audio Signals Tamás Pál, Dániel T. Várkonyi Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering, Telekom Innovation Laboratories, Budapest, Hungary {evwolcheim, varkonyid}@inf.elte.hu WWW home page: http://t-labs.elte.hu Abstract: Analysis of audio signals is widely used and this work: car horn, dog bark, engine idling, gun shot, and very effective technique in several domains like health- street music [5]. care, transportation, and agriculture. In a general process Related work is presented in Section 2, basic mathe- the output of the feature extraction method results in huge matical notation used is described in Section 3, while the number of relevant features which may be difficult to pro- different methods of the pipeline are briefly presented in cess. The number of features heavily correlates with the Section 4. Section 5 contains data about the evaluation complexity of the following machine learning method. Di- methodology, Section 6 presents the results and conclu- mensionality reduction methods have been used success- sions are formulated in Section 7. fully in recent times in machine learning to reduce com- The goal of this paper is to find a combination of feature plexity and memory usage and improve speed of following extraction and dimensionality reduction methods which ML algorithms. This paper attempts to compare the state can be most efficiently applied to audio data visualization of the art dimensionality reduction techniques as a build- in 2D and preserve inter-class relations the most. ing block of the general process and analyze the usability of these methods in visualizing large audio datasets.
    [Show full text]
  • Litrec Vs. Movielens a Comparative Study
    LitRec vs. Movielens A Comparative Study Paula Cristina Vaz1;4, Ricardo Ribeiro2;4 and David Martins de Matos3;4 1IST/UAL, Rua Alves Redol, 9, Lisboa, Portugal 2ISCTE-IUL, Rua Alves Redol, 9, Lisboa, Portugal 3IST, Rua Alves Redol, 9, Lisboa, Portugal 4INESC-ID, Rua Alves Redol, 9, Lisboa, Portugal fpaula.vaz, ricardo.ribeiro, [email protected] Keywords: Recommendation Systems, Data Set, Book Recommendation. Abstract: Recommendation is an important research area that relies on the availability and quality of the data sets in order to make progress. This paper presents a comparative study between Movielens, a movie recommendation data set that has been extensively used by the recommendation system research community, and LitRec, a newly created data set for content literary book recommendation, in a collaborative filtering set-up. Experiments have shown that when the number of ratings of Movielens is reduced to the level of LitRec, collaborative filtering results degrade and the use of content in hybrid approaches becomes important. 1 INTRODUCTION and (b) assess its suitability in recommendation stud- ies. The number of books, movies, and music items pub- The paper is structured as follows: Section 2 de- lished every year is increasing far more quickly than scribes related work on recommendation systems and is our ability to process it. The Internet has shortened existing data sets. In Section 3 we describe the collab- the distance between items and the common user, orative filtering (CF) algorithm and prediction equa- making items available to everyone with an Internet tion, as well as different item representation. Section connection.
    [Show full text]
  • Self-Supervised Learning
    Self-Supervised Learning Andrew Zisserman Slides from: Carl Doersch, Ishan Misra, Andrew Owens, Carl Vondrick, Richard Zhang The ImageNet Challenge Story … 1000 categories • Training: 1000 images for each category • Testing: 100k images The ImageNet Challenge Story … strong supervision The ImageNet Challenge Story … outcomes Strong supervision: • Features from networks trained on ImageNet can be used for other visual tasks, e.g. detection, segmentation, action recognition, fine grained visual classification • To some extent, any visual task can be solved now by: 1. Construct a large-scale dataset labelled for that task 2. Specify a training loss and neural network architecture 3. Train the network and deploy • Are there alternatives to strong supervision for training? Self-Supervised learning …. Why Self-Supervision? 1. Expense of producing a new dataset for each new task 2. Some areas are supervision-starved, e.g. medical data, where it is hard to obtain annotation 3. Untapped/availability of vast numbers of unlabelled images/videos – Facebook: one billion images uploaded per day – 300 hours of video are uploaded to YouTube every minute 4. How infants may learn … Self-Supervised Learning The Scientist in the Crib: What Early Learning Tells Us About the Mind by Alison Gopnik, Andrew N. Meltzoff and Patricia K. Kuhl The Development of Embodied Cognition: Six Lessons from Babies by Linda Smith and Michael Gasser What is Self-Supervision? • A form of unsupervised learning where the data provides the supervision • In general, withhold some part of the data, and task the network with predicting it • The task defines a proxy loss, and the network is forced to learn what we really care about, e.g.
    [Show full text]
  • Reinforcement Learning in Supervised Problem Domains
    Technische Universität München Fakultät für Informatik Lehrstuhl VI – Echtzeitsysteme und Robotik reinforcement learning in supervised problem domains Thomas F. Rückstieß Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. Daniel Cremers Prüfer der Dissertation 1. Univ.-Prof. Dr. Patrick van der Smagt 2. Univ.-Prof. Dr. Hans Jürgen Schmidhuber Die Dissertation wurde am 30. 06. 2015 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 18. 09. 2015 angenommen. Thomas Rückstieß: Reinforcement Learning in Supervised Problem Domains © 2015 email: [email protected] ABSTRACT Despite continuous advances in computing technology, today’s brute for- ce data processing approaches may not provide the necessary advantage to win the race against the ever-growing amount of data that can be wit- nessed over the last decades. In this thesis, we discuss novel methods and algorithms that are capable of directing attention to relevant details and analysing it in sequence to overcome the processing bottleneck and to keep up with this data explosion. In the first of three parts, a novel exploration technique for Policy Gradi- ent Reinforcement Learning is presented which replaces traditional ad- ditive random exploration with state-dependent exploration, exploring on a higher, more strategic level. We will show how this new exploration method converges faster and finds better global solutions than random exploration can. The second part of this thesis will introduce the concept of “data con- sumption” and discuss means to minimise it in supervised learning tasks by deriving classification as a sequential decision process and ma- king it accessible to Reinforcement Learning methods.
    [Show full text]
  • Self-Training Wavenet for TTS in Low-Data Regimes
    StrawNet: Self-Training WaveNet for TTS in Low-Data Regimes Manish Sharma, Tom Kenter, Rob Clark Google UK fskmanish, tomkenter, [email protected] Abstract is increased. However, it can be seen from their results that the quality degrades when the number of recordings is further Recently, WaveNet has become a popular choice of neural net- decreased. work to synthesize speech audio. Autoregressive WaveNet is To reduce the voice artefacts observed in WaveNet stu- capable of producing high-fidelity audio, but is too slow for dent models trained under a low-data regime, we aim to lever- real-time synthesis. As a remedy, Parallel WaveNet was pro- age both the high-fidelity audio produced by an autoregressive posed, which can produce audio faster than real time through WaveNet, and the faster-than-real-time synthesis capability of distillation of an autoregressive teacher into a feedforward stu- a Parallel WaveNet. We propose a training paradigm, called dent network. A shortcoming of this approach, however, is that StrawNet, which stands for “Self-Training WaveNet”. The key a large amount of recorded speech data is required to produce contribution lies in using high-fidelity speech samples produced high-quality student models, and this data is not always avail- by an autoregressive WaveNet to self-train first a new autore- able. In this paper, we propose StrawNet: a self-training ap- gressive WaveNet and then a Parallel WaveNet model. We refer proach to train a Parallel WaveNet. Self-training is performed to models distilled this way as StrawNet student models. using the synthetic examples generated by the autoregressive We evaluate StrawNet by comparing it to a baseline WaveNet teacher.
    [Show full text]
  • Infinite Variational Autoencoder for Semi-Supervised Learning
    Infinite Variational Autoencoder for Semi-Supervised Learning M. Ehsan Abbasnejad Anthony Dick Anton van den Hengel The University of Adelaide {ehsan.abbasnejad, anthony.dick, anton.vandenhengel}@adelaide.edu.au Abstract with whatever labelled data is available to train a discrimi- native model for classification. This paper presents an infinite variational autoencoder We demonstrate that our infinite VAE outperforms both (VAE) whose capacity adapts to suit the input data. This the classical VAE and standard classification methods, par- is achieved using a mixture model where the mixing coef- ticularly when the number of available labelled samples is ficients are modeled by a Dirichlet process, allowing us to small. This is because the infinite VAE is able to more ac- integrate over the coefficients when performing inference. curately capture the distribution of the unlabelled data. It Critically, this then allows us to automatically vary the therefore provides a generative model that allows the dis- number of autoencoders in the mixture based on the data. criminative model, which is trained based on its output, to Experiments show the flexibility of our method, particularly be more effectively learnt using a small number of samples. for semi-supervised learning, where only a small number of The main contribution of this paper is twofold: (1) we training samples are available. provide a Bayesian non-parametric model for combining autoencoders, in particular variational autoencoders. This bridges the gap between non-parametric Bayesian meth- 1. Introduction ods and the deep neural networks; (2) we provide a semi- supervised learning approach that utilizes the infinite mix- The Variational Autoencoder (VAE) [18] is a newly in- ture of autoencoders learned by our model for prediction troduced tool for unsupervised learning of a distribution x x with from a small number of labeled examples.
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J
    1 Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being desired in several tasks, such as low resource automatic speech invariant to confounding low level details in the signal such as recognition (ASR), where only a small amount of labeled the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the to learn the acoustic model and a data representation in a fully kind of constraint that is applied to the latent representation. supervised manner [15], [16]. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately re- [17].
    [Show full text]
  • Unsupervised Speech Representation Learning Using Wavenet Autoencoders
    Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks” Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks” Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training Deep Learning Recipe 1. Get a massive, labeled dataset 퐷 = {(푥, 푦)}: – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 푝(푦|푥) Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits! Value of Unlabeled Data “The brain has about 1014 synapses and we only live for about 109 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 105 dimensions of constraint per second.” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/ Unsupervised learning recipe 1. Get a massive labeled dataset 퐷 = 푥 Easy, unlabeled data is nearly free 2. Train model to…??? What is the task? What is the loss function? Unsupervised learning by modeling data distribution Train the model to minimize − log 푝(푥) E.g.
    [Show full text]
  • Real-Time Black-Box Modelling with Recurrent Neural Networks
    Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19), Birmingham, UK, September 2–6, 2019 REAL-TIME BLACK-BOX MODELLING WITH RECURRENT NEURAL NETWORKS Alec Wright, Eero-Pekka Damskägg, and Vesa Välimäki∗ Acoustics Lab, Department of Signal Processing and Acoustics Aalto University Espoo, Finland [email protected] ABSTRACT tube amplifiers and distortion pedals. In [14] it was shown that the WaveNet model of several distortion effects was capable of This paper proposes to use a recurrent neural network for black- running in real time. The resulting deep neural network model, box modelling of nonlinear audio systems, such as tube amplifiers however, was still fairly computationally expensive to run. and distortion pedals. As a recurrent unit structure, we test both Long Short-Term Memory and a Gated Recurrent Unit. We com- In this paper, we propose an alternative black-box model based pare the proposed neural network with a WaveNet-style deep neu- on an RNN. We demonstrate that the trained RNN model is capa- ral network, which has been suggested previously for tube ampli- ble of achieving the accuracy of the WaveNet model, whilst re- fier modelling. The neural networks are trained with several min- quiring considerably less processing power to run. The proposed utes of guitar and bass recordings, which have been passed through neural network, which consists of a single recurrent layer and a the devices to be modelled. A real-time audio plugin implement- fully connected layer, is suitable for real-time emulation of tube ing the proposed networks has been developed in the JUCE frame- amplifiers and distortion pedals.
    [Show full text]